PWM Score¶
The PWM feature (name: PWM) scores peptides against Position Weight Matrices derived from known MHC ligand data. PWM scoring is a classical, fast, and interpretable method for estimating MHC binding potential.
Source name: PWM
The PWM matrices used in OptiMHC are derived from the SysteMHC Atlas, a public data repository of immunopeptidomics datasets generated by mass spectrometry:
Shao, W.; Pedrioli, P.G.A. et al. The SysteMHC Atlas project. Nucleic Acids Res (2018). doi:10.1093/nar/gkx664
Huang, X.; Gan, Z. et al. The SysteMHC Atlas v2.0, an updated resource for mass spectrometry-based immunopeptidomics. Nucleic Acids Res (2024). doi:10.1093/nar/gkad1068
Output Columns¶
The number and names of output columns depend on the MHC class and alleles.
MHC Class I¶
| Column | Description |
|---|---|
PWM_Score_{allele} |
Total PWM score for the peptide |
Anchor_Score_{allele} |
PWM score at the most conserved (anchor) positions only |
MHC Class II¶
| Column | Description |
|---|---|
PWM_Score_{allele} |
Best 9-mer core binding score |
N_Flank_PWM_Score_{allele} |
Score for the N-terminal flanking residues |
C_Flank_PWM_Score_{allele} |
Score for the C-terminal flanking residues |
Computation¶
MHC Class I Scoring¶
For Class I, a length-specific PWM is loaded for each allele and peptide length. Each entry \( W_{a,j} \) represents the log-odds weight for amino acid \( a \) at position \( j \).
Total PWM score. The PWM score \( S \) for a peptide of length \( L \) with amino acid \( a_j \) at position \( j \) is:
Anchor score. The anchor positions are identified as the \( n \) most conserved positions in the PWM (default \( n = 2 \)). Conservation is measured by the positional Shannon entropy \( H_j \):
where the probabilities are derived from the PWM weights:
The \( n \) positions with the lowest \( H_j \) (highest conservation) are selected as anchor positions. The anchor score \( S_\mathrm{anc} \) is:
where \( \mathcal{J}_\mathrm{anc} \) is the set of anchor positions.
MHC Class II Scoring¶
MHC Class II molecules bind peptides through a 9-mer binding core embedded within a longer peptide. The PWM is always 9 positions long.
Core binding score. A sliding window scans all possible 9-mer frames, and the frame with the highest score is selected. Let \( s \) be the starting position:
Flanking scores. Once the best core position is determined, the flanking residues are scored:
- N-terminal flank: up to 3 residues immediately preceding the binding core. If fewer than 3 residues are available, the sequence is padded with
X(unknown amino acid). - C-terminal flank: up to 3 residues immediately following the binding core, similarly padded.
Each flank is scored against its own 3-position PWM. Let \( W^{(N)} \) and \( W^{(C)} \) denote the N-flank and C-flank matrices. The flanking scores \( S_N \) and \( S_C \) are:
Preprocessing¶
- Flanking amino acids are stripped and modifications are removed.
- Peptides for which no matching PWM is available (e.g., unsupported length for Class I) receive median-imputed values.
Configuration¶
featureGenerator:
- name: PWM
params:
class: I # "I" or "II"
anchors: 2 # Number of anchor positions (Class I only)
| Parameter | Default | Description |
|---|---|---|
class |
(required) | MHC class: "I" or "II" |
anchors |
2 |
Number of anchor positions for anchor score computation (Class I only) |
The alleles are taken from the top-level allele configuration.