Features Overview¶
OptiMHC uses a modular feature generation system. Each feature computes a set of scoring columns that are added to the PSM data before rescoring. By combining features from multiple orthogonal sources, the rescoring model can better distinguish true peptide identifications from false ones.
Feature Architecture¶
All features inherit from BaseFeatureGenerator and implement a common interface:
feature_columns— the list of output column names.id_column— the key column(s) used to merge features back into the PsmContainer.generate_features()— computes and returns a DataFrame of features.
Features are configured in the YAML file as a list of {name, params} entries. The pipeline instantiates each feature, calls generate_features(), and merges the result into the PsmContainer.
Feature Categories¶
| Category | Name | Columns | Description |
|---|---|---|---|
| Original Features | PepXML / PIN parser | Variable | Search engine scores and derived mass/ion features |
| Basic Features | Basic |
5 | Peptide sequence properties (length, entropy, AA composition) |
| Spectral Similarity | SpectralSimilarity |
8 | Similarity between experimental and predicted spectra |
| Retention Time Deviation | DeepLC |
5 | Deviation between observed and predicted retention time |
| Antigen Presentation Scores | NetMHCpan, NetMHCIIpan, MHCflurry |
3–4 per tool | MHC binding affinity and presentation predictions |
| PWM Score | PWM |
1–3 per allele | Position weight matrix binding score |
| Overlapping Peptide Score | OverlappingPeptide |
4 | Graph-based peptide overlap and contig assembly features |
Preprocessing¶
Most features share common preprocessing steps:
- Strip flanking amino acids — remove the preceding and next amino acid notation (e.g.,
K.PEPTIDE.RbecomesPEPTIDE). - Remove modifications — strip bracketed mass annotations (e.g.,
PEPTM[147.035]IDEbecomesPEPTMIDE).
Selenocysteine (U) handling
Selenocysteine (U) is an extremely rare amino acid that most prediction tools do not support. During preprocessing, U is automatically replaced with cysteine (C) to ensure compatibility with external tools such as Koina, DeepLC, MHCflurry, NetMHCpan, and the PWM scoring matrices.
Missing Value Handling¶
When a feature cannot be computed for a PSM (e.g., peptide length outside the supported range for a binding predictor), the missing values are filled with the median of the successfully computed values for that column. This ensures no NaN values propagate to the rescoring model.