Original Features¶
Original features are the initial set of scores and derived quantities extracted during input parsing. They form the baseline feature set (source name: "Original") that every rescoring run includes. The specific features depend on the input format.
PepXML Features¶
When parsing PepXML files, OptiMHC extracts raw search engine scores and computes several derived features.
Search Engine Scores¶
All <search_score> elements from the PepXML file are extracted as features. The exact columns depend on the search engine that produced the file. For example, Comet produces:
xcorr— cross-correlation scoredeltacn— delta correlation (difference to next-best match)deltacnstar— delta CN starspscore— preliminary scoresprank— SP rankexpect— expectation value (E-value)
Other search engines (X!Tandem, MSFragger, etc.) produce their own score columns, all of which are automatically included.
Mass Difference Features¶
Let \( m_\mathrm{exp} \) and \( m_\mathrm{calc} \) denote the experimental and calculated precursor neutral masses, respectively. The mass difference \( \Delta m \) is:
The experimental and calculated m/z values are derived from the charge state \( z \) using the proton mass \( m_\mathrm{H} = 1.00727646677 \) Da:
The absolute m/z difference \( \Delta_{mz} \) is:
Matched Ion Ratio¶
When the search engine reports matched and total ions, the matched ion ratio \( R_\mathrm{ion} \) is computed:
This feature is only included if both values are present and \( n_\mathrm{total} > 0 \) for all PSMs.
Charge One-Hot Encoding¶
The precursor charge state is one-hot encoded into binary columns:
charge_1,charge_2,charge_3, ...
For a PSM with charge 2, charge_2 = 1 and all other charge columns are 0. This allows the rescoring model to learn charge-specific effects without treating charge as a continuous variable.
Log Transformation of P-values and E-values¶
Many search engine scores span several orders of magnitude (e.g., E-values from \( 10^{-20} \) to \( 10^{2} \)). Following the approach used in mokapot, OptiMHC applies automatic log-transformation to compress these ranges into a scale more suitable for machine learning models.
Scientific notation values: If values contain scientific notation and span 4 or more orders of magnitude, the log transform for a value \( x = a \times 10^{b} \) is:
For zero values, the log is set to one less than the minimum observed log value.
Numeric columns: For non-negative, non-binary numeric columns where \( x_\mathrm{max} / x_\mathrm{min}^{+} \geq 10{,}000 \) (where \( x_\mathrm{min}^{+} \) is the smallest nonzero value):
The num_matched_peptides column is always log-transformed as \( \log_{10}(x) \).
Decoy Detection¶
A PSM is labeled as a decoy if the first protein accession starts with the decoyPrefix (default: DECOY_). If any alternative protein accession does not start with the prefix, the PSM is re-labeled as a target.
Additional Metadata¶
The following columns are extracted but treated as metadata (not used as rescoring features):
ms_data_file— raw file identifierscan— scan numberspectrum— spectrum ID stringlabel— target/decoy labelcalc_mass— calculated neutral masspeptide— peptide sequence (with modifications)proteins— protein accessionscharge— precursor charge (kept as metadata; one-hot columns are the features)retention_time— retention time in seconds
PIN Features¶
When parsing PIN (Percolator Input) files, the feature set is determined by the file itself. All columns that are not recognized as metadata (Label, ScanNr, SpecId, Peptide, Proteins, rank, Charge) are treated as "Original" rescoring features.
Column name matching is case-insensitive. Charge columns matching the pattern charge[_]?\d+ (e.g., charge1, charge_2) are detected automatically. For each PSM, the charge value is determined by which charge column contains a value of 1.
The Label column uses the convention 1 for target and -1 for decoy.