Features Overview¶

OptiMHC uses a modular feature generation system. Each feature computes a set of scoring columns that are added to the PSM data before rescoring. By combining features from multiple orthogonal sources, the rescoring model can better distinguish true peptide identifications from false ones.

Feature Architecture¶

All features inherit from BaseFeatureGenerator and implement a common interface:

feature_columns — the list of output column names.
id_column — the key column(s) used to merge features back into the PsmContainer.
generate_features() — computes and returns a DataFrame of features.

Features are configured in the YAML file as a list of {name, params} entries. The pipeline instantiates each feature, calls generate_features(), and merges the result into the PsmContainer.

Feature Categories¶

Category	Name	Columns	Description
Original Features	PepXML / PIN parser	Variable	Search engine scores and derived mass/ion features
Basic Features	`Basic`	5	Peptide sequence properties (length, entropy, AA composition)
Spectral Similarity	`SpectralSimilarity`	8	Similarity between experimental and predicted spectra
Retention Time Deviation	`DeepLC`	5	Deviation between observed and predicted retention time
Antigen Presentation Scores	`NetMHCpan`, `NetMHCIIpan`, `MHCflurry`	3–4 per tool	MHC binding affinity and presentation predictions
PWM Score	`PWM`	1–3 per allele	Position weight matrix binding score
Overlapping Peptide Score	`OverlappingPeptide`	4	Graph-based peptide overlap and contig assembly features

Preprocessing¶

Most features share common preprocessing steps:

Strip flanking amino acids — remove the preceding and next amino acid notation (e.g., K.PEPTIDE.R becomes PEPTIDE).
Remove modifications — strip bracketed mass annotations (e.g., PEPTM[147.035]IDE becomes PEPTMIDE).

Selenocysteine (U) handling

Selenocysteine (U) is an extremely rare amino acid that most prediction tools do not support. During preprocessing, U is automatically replaced with cysteine (C) to ensure compatibility with external tools such as Koina, DeepLC, MHCflurry, NetMHCpan, and the PWM scoring matrices.

Missing Value Handling¶

When a feature cannot be computed for a PSM (e.g., peptide length outside the supported range for a binding predictor), the missing values are filled with the median of the successfully computed values for that column. This ensures no NaN values propagate to the rescoring model.