How are signatures computed?¶

Signatures are computed using a two-step filtering process that identifies the most discriminative features between two sample subsets (Either A vs B or A vs Rest). This method combines distributional analysis with classification performance metrics to select features that are both statistically different and practically useful for distinguishing sample populations.

Step-by-Step Process¶

For every feature, compute the Wasserstein distance between the 2 distributions (one for each subset)
Sort features by descending Wasserstein distance and keep the 100 best features
For each of these 100 features, compute the cut with the best Matthews correlation coefficient to distinguish the two subsets
Select and show the 20 features with the best MCC scores as the final signature

Input: Two sample subsets
       ↓
Step 1: Wasserstein Distance Filtering
       ↓
Rank all features by Wasserstein distance
       ↓
Keep top 100 features
       ↓
Step 2: MCC Refinement
       ↓
Compute MCC for each feature
       ↓
Select top 20 features by |MCC|
       ↓
Output: Signature with 20 discriminative features

Step 1: Wasserstein Distance Filtering¶

Identify features with the most different distributions between two groups by measuring the “distance” between their probability distributions. For two Gaussian distributions N(μ₁, σ₁²) and N(μ₂, σ₂²), the 2-Wasserstein distance is:

\[W₂ = \sqrt{(\mu₁ - \mu₂)² + (\sigma₁ - \sigma₂)²}\]

Implementation¶

def wasserstein_distance(mu1: float, sigma1: float, mu2: float, sigma2: float) -> float:
    """
    Compute the Wasserstein distance between two Gaussian distributions.

    Returns the L2 distance between distribution parameters.
    """
    mean_diff = mu1 - mu2
    std_diff = sigma1 - sigma2
    wasserstein = np.sqrt(mean_diff**2 + std_diff**2)
    return wasserstein

The algorithm first computes group statistics for each feature, calculating mean and standard deviation for Group A and Group B (either a specific subset or the remaining samples). It then calculates the Wasserstein distance between distributions for each feature, ranks all features in descending order by Wasserstein distance, and selects the top 100 features with the highest distributional differences.

Why Wasserstein Distance?¶

Wasserstein distance captures both location and scale differences, being sensitive to changes in both mean and variance. It provides intuitive interpretation where higher values indicate more separated distributions, offers computational efficiency through closed-form solutions for Gaussian distributions, and demonstrates robustness by being less sensitive to outliers compared to KL divergence, chi-squared distance, or total variation distance.

Output Interpretation¶

The algorithm returns the 20 most discriminative feature indices, quantitative MCC scores measuring discriminative power, and direction indicators showing whether features are upregulated (“+”) or downregulated (“-”) in Group A versus Group B.

# Example output interpretation
signature_features = ["Gene_A", "Gene_B", "Gene_C"]
mcc_scores = [0.85, 0.72, 0.68]
directions = ["+", "-", "+"]

# Interpretation:
# Gene_A: Highly upregulated in Group A (MCC = 0.85)
# Gene_B: Moderately downregulated in Group A (MCC = 0.72)
# Gene_C: Moderately upregulated in Group A (MCC = 0.68)

Algorithm Advantages¶

The two-stage filtering prevents multiple testing issues by reducing the search space from all features to just 100, then to 20, implemented as selected_features = sorted_features[:100] followed by new_selected_features = new_selected_features[:20]. The Wasserstein distance captures both mean and variance differences through mean_diff = mu1 - mu2 and std_diff = sigma1 - sigma2, enabling detection of distributional changes beyond simple fold-change analysis. The algorithm uses vectorized operations throughout the MCC computation with ranks = rankdata(all_scores, method="min", axis=1) and processes all features simultaneously rather than using loops. The rank-based approach eliminates preprocessing requirements since rankdata() provides inherent scale invariance without needing log-transformation or z-scoring steps.

Finally, the algorithm produces interpretable feature signatures with directionality indicators ("+" or "-") and quantitative MCC scores formatted as "(MCC:{:.3f})", including feature names from ad.var_names[outputs] with their discriminative power, enabling us to identify both upregulated and downregulated markers for experimental follow-up.

Usage Examples¶

A vs Rest Comparison¶

# Compare subset A against all remaining samples
signature, mcc_dict, directions = compute_signature(
    adata=adata,
    means=global_means,
    stds=global_stds,
    obs_indices_A=subset_A_indices,
    obs_indices_B=None  # Compare against rest
)

A vs B Comparison¶

# Compare two specific subsets
signature, mcc_dict, directions = compute_signature(
    adata=adata,
    means=global_means,
    stds=global_stds,
    obs_indices_A=subset_A_indices,
    obs_indices_B=subset_B_indices
)

Wasserstein Distance Interpretation¶

High values (> 1.0) indicate strong distributional differences, medium values (0.3-1.0) suggest moderate differences, while low values (< 0.3) represent subtle differences between groups.

MCC Score Interpretation¶

Absolute MCC values above 0.8 indicate excellent discriminative power, values above 0.5 suggest good discriminative power, values above 0.3 represent moderate discriminative power, and values below 0.3 indicate poor discriminative ability.