Feature Generators¶
Base Class¶
base_feature_generator
¶
PsmContainer(psms, label_column, scan_column, spectrum_column, ms_data_file_column, peptide_column, protein_column, rescoring_features, hit_rank_column=None, charge_column=None, retention_time_column=None, calculated_mass_column=None, metadata_column=None)
¶
A container for managing peptide-spectrum matches (PSMs) in immunopeptidomics rescoring pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
DataFrame
|
DataFrame containing the PSM data. |
required |
label_column
|
str
|
Column containing the label (True for target, False for decoy). |
required |
scan_column
|
str
|
Column containing the scan number. |
required |
spectrum_column
|
str
|
Column containing the spectrum identifier. |
required |
ms_data_file_column
|
str
|
Column containing the MS data file that the PSM originated from. |
required |
peptide_column
|
str
|
Column containing the peptide sequence. |
required |
protein_column
|
str
|
Column containing the protein accessions. |
required |
rescoring_features
|
dict of str to list of str
|
Dictionary of feature columns for rescoring. |
required |
hit_rank_column
|
str
|
Column containing the hit rank. |
None
|
charge_column
|
str
|
Column containing the charge state. |
None
|
retention_time_column
|
str
|
Column containing the retention time. |
None
|
calculated_mass_column
|
str
|
Column containing the calculated mass. |
None
|
metadata_column
|
str
|
Column containing metadata. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
psms |
DataFrame
|
Copy of the DataFrame containing the PSM data. |
target_psms |
DataFrame
|
DataFrame containing only target PSMs (label = True). |
decoy_psms |
DataFrame
|
DataFrame containing only decoy PSMs (label = False). |
peptides |
list of str
|
List containing all peptides from the PSM data. |
columns |
list of str
|
List of column names in the PSM DataFrame. |
rescoring_features |
dict of str to list of str
|
Dictionary of rescoring feature columns in the PSM DataFrame. |
Source code in optimhc/psm_container.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
psms
property
¶
Get a copy of the PSM DataFrame to prevent external modification.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A copy of the PSM DataFrame. |
target_psms
property
¶
Get a DataFrame containing only target PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with only target PSMs (label = True). |
decoy_psms
property
¶
Get a DataFrame containing only decoy PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with only decoy PSMs (label = False). |
columns
property
¶
Get the column names of the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of column names. |
feature_columns
property
¶
Get a list of all feature columns in the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of feature column names. |
feature_sources
property
¶
Get a list of all feature sources in the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of feature source names. |
peptides
property
¶
Get the peptide sequences from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of peptide sequences. |
ms_data_files
property
¶
Get the MS data files from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of MS data file names. |
scan_ids
property
¶
Get the scan numbers from the PSM data.
Returns:
| Type | Description |
|---|---|
list of int
|
List of scan numbers. |
charges
property
¶
Get the charge states from the PSM data.
Returns:
| Type | Description |
|---|---|
list of int
|
List of charge states. |
metadata
property
¶
Get the metadata from the PSM data.
Returns:
| Type | Description |
|---|---|
Series
|
Series containing metadata for each PSM. |
spectrum_ids
property
¶
Get the spectrum identifiers from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of spectrum identifiers. |
identifier_columns
property
¶
Get the columns that uniquely identify each PSM.
Returns:
| Type | Description |
|---|---|
list of str
|
List of identifier column names. |
__len__()
¶
copy()
¶
Return a deep copy of the PsmContainer object.
Returns:
| Type | Description |
|---|---|
PsmContainer
|
A deep copy of the current PsmContainer. |
__repr__()
¶
Return a string representation of the PsmContainer.
Returns:
| Type | Description |
|---|---|
str
|
String summary of the PsmContainer. |
Source code in optimhc/psm_container.py
drop_features(features)
¶
Drop specified features from the PSM DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
list of str
|
List of feature column names to drop. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any of the features do not exist in the DataFrame. |
Source code in optimhc/psm_container.py
drop_source(source)
¶
Drop all features associated with a specific source from the PSM DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Name of the source to drop. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source does not exist in the rescoring features. |
Source code in optimhc/psm_container.py
add_metadata(metadata_df, psms_key, metadata_key, source)
¶
Merge new metadata into the PSM DataFrame based on specified columns. Metadata from the specified source is stored as a nested dictionary inside the metadata column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_df
|
DataFrame
|
DataFrame containing new metadata to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
metadata_key
|
str or list of str
|
Column name(s) in the metadata data to merge on. |
required |
source
|
str
|
Name of the source of the new metadata. |
required |
Source code in optimhc/psm_container.py
get_top_hits(n=1)
¶
Get the top n hits based on the hit rank column. If the hit rank column is not specified, returns the original PSMs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
The number of top hits to return. Default is 1. |
1
|
Returns:
| Type | Description |
|---|---|
PsmContainer
|
A new PsmContainer object containing the top n hits. |
Source code in optimhc/psm_container.py
add_features(features_df, psms_key, feature_key, source, suffix=None)
¶
Merge new features into the PSM DataFrame based on specified columns.
This method performs a left join between the PSM data and feature data, ensuring that all PSMs are preserved while adding new features. It handles column name conflicts through optional suffixing and maintains feature source tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
DataFrame containing new features to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
feature_key
|
str or list of str
|
Column name(s) in the features data to merge on. |
required |
source
|
str
|
Name of the source of the new features (e.g., 'deeplc', 'netmhc'). |
required |
suffix
|
str
|
Suffix to add to the new columns if there's a name conflict. Required when new feature columns have the same names as existing columns. For example, if adding features from different sources (e.g., 'score' from DeepLC and NetMHC), use suffixes like '_deeplc' or '_netmhc' to distinguish them. |
None
|
Returns:
| Type | Description |
|---|---|
None
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If duplicate columns exist without suffix. If merging features changes the number of PSMs. |
Notes
The method follows these steps: 1. Validates input and prepares merge keys 2. Checks for column name conflicts 3. Manages feature source: if the source already exists, it will be overwritten 4. Performs left join merge 5. Verifies data integrity
Suffix Usage
The suffix parameter is used to handle column name conflicts: - When adding features from different sources that might have the same column names - When you want to keep both the original and new features with the same name - When you need to track the source of features in the column names
If suffix is not provided and there are duplicate column names: - The method will raise a ValueError - You must either provide a suffix or rename the columns before adding
Examples:
>>> container = PsmContainer(...)
>>> # Adding features without suffix (no conflicts)
>>> features_df1 = pd.DataFrame({
... 'scan': [1, 2, 3],
... 'feature1': [0.1, 0.2, 0.3],
... 'feature2': [0.4, 0.5, 0.6]
... })
>>> container.add_features(
... features_df1,
... psms_key='scan',
... feature_key='scan',
... source='source1'
... )
>>> # Adding features with suffix (handling conflicts)
>>> features_df2 = pd.DataFrame({
... 'scan': [1, 2, 3],
... 'score': [0.8, 0.9, 0.7], # This would conflict with existing 'score'
... 'feature3': [0.7, 0.8, 0.9]
... })
>>> container.add_features(
... features_df2,
... psms_key='scan',
... feature_key='scan',
... source='source2',
... suffix='_new' # 'score' becomes 'score_new'
... )
Source code in optimhc/psm_container.py
481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 | |
add_features_by_index(features_df, source, suffix=None)
¶
Merge new features into the PSM DataFrame based on the DataFrame index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
DataFrame containing new features to add. |
required |
source
|
str
|
Name of the source of the new features. |
required |
suffix
|
str
|
Suffix to add to the new columns if there's a name conflict. |
None
|
Source code in optimhc/psm_container.py
add_results(results_df, psms_key, result_key)
¶
Add results of rescore engine to the PSM DataFrame based on specified columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_df
|
DataFrame
|
DataFrame containing new results to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
result_key
|
str or list of str
|
Column name(s) in the results data to merge on. |
required |
Source code in optimhc/psm_container.py
write_pin(output_file, style='default', source=None)
¶
Write the PSM data to a Percolator PIN file, supporting both generic Percolator and MSBooster-compatible formats. The style parameter is actually used to output a unified pin format file to benchmark the performance of different rescoring methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_file
|
str
|
Path to the output PIN file. |
required |
style
|
str
|
If set to 'msbooster', outputs only the columns required by MSBooster (SpecId, Label, ScanNr, retentiontime, rank, hyperscore or log10_evalue, Peptide, Proteins).
If set to 'default', outputs all features specified in |
'default'
|
source
|
list of str
|
List of feature sources to include. If None, includes all sources. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The DataFrame written to the PIN file. |
Notes
- The first three columns are always: SpecID, Label, ScanNr.
- For 'msbooster' style, the columns are: SpecId, Label, ScanNr, retentiontime, rank, hyperscore or log10_evalue, Peptide, Proteins.
- If
hit_rank_columnis not specified, rank is set to 1 for all rows. - Either 'hyperscore' or 'expect' must be present in features; for 'expect', the column is written as 'log10_evalue'.
- The 'log10_evalue' column should contain the base-10 logarithm of the e-value.
- The 'Peptide' column is formatted with underscores (e.g.,
_.PEPTIDE._). - For standard format, all features from
rescoring_featuresare appended between ScanNr and Peptide columns. - The 'Proteins' column is a semicolon-separated list if stored as a list or tuple.
- Label column is converted to 1 (target) and -1 (decoy), as required by Percolator.
Example output (default style): SpecId Label ScanNr feature1 feature2 ... Peptide Proteins
Example output (msbooster style): SpecId Label ScanNr retentiontime rank hyperscore Peptide Proteins or SpecId Label ScanNr retentiontime rank log10_evalue Peptide Proteins
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing for the selected style. |
Source code in optimhc/psm_container.py
746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 | |
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
Basic¶
basic
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
BasicFeatureGenerator(peptides, remove_pre_nxt_aa=True, remove_modification=True, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Feature generator that generates basic features from peptide sequences.
This generator calculates features such as peptide length, proportion of unique amino acids, Shannon entropy of amino acid distribution, difference between peptide length and average peptide length, and count of unique amino acids.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
List[str]
|
List of peptide sequences to generate features for. |
required |
remove_pre_nxt_aa
|
bool
|
Whether to remove the amino acids adjacent to the peptide. If True, removes them. Default is True. |
True
|
remove_modification
|
bool
|
Whether to remove modifications in the peptide sequences. If True, removes them. Default is True. |
True
|
Notes
The generated features include: - length_diff_from_avg: Difference between peptide length and average length - abs_length_diff_from_avg: Absolute difference between peptide length and average length - unique_aa_count: Number of unique amino acids in the peptide - unique_aa_proportion: Proportion of unique amino acids in the peptide - shannon_entropy: Shannon entropy of amino acid distribution
Source code in optimhc/feature/basic.py
feature_columns
property
¶
Return the list of generated feature column names.
id_column
property
¶
Return the list of input columns required for feature generation.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of input column names required for feature generation. Currently only requires 'Peptide' column. |
generate_features()
¶
Generate basic features for the provided peptides.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing peptides and their computed features: - length_diff_from_avg: Difference from average peptide length - abs_length_diff_from_avg: Absolute difference from average length - unique_aa_count: Number of unique amino acids - unique_aa_proportion: Proportion of unique amino acids - shannon_entropy: Shannon entropy of amino acid distribution |
Raises:
| Type | Description |
|---|---|
ValueError
|
If NaN values are found in the generated features. |
Notes
All features are converted to float type before returning. The method calculates average peptide length across all peptides and uses it as a reference for length-based features.
Source code in optimhc/feature/basic.py
Spectral Similarity¶
spectral_similarity
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
SpectralSimilarityFeatureGenerator(spectrum_ids, peptides, charges, scan_ids, mz_file_paths, model_type, collision_energies=None, instruments=None, fragmentation_types=None, remove_pre_nxt_aa=False, mod_dict=None, url='koina.wilhelmlab.org:443', ssl=True, top_n=36, tolerance_ppm=20)
¶
Bases: BaseFeatureGenerator
Feature generator for calculating similarity between experimental and predicted spectra.
This class works through the following steps: 1. Extract experimental spectral data from mzML files 2. Use Koina for theoretical spectra prediction 3. Align experimental and predicted spectra 4. Calculate similarity metrics as features
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
list of str
|
List of peptide sequences. |
required |
charges
|
list of int
|
List of charge states. |
required |
scan_ids
|
list of int
|
List of scan IDs. |
required |
mz_file_paths
|
list of str
|
List of mzML file paths. |
required |
model_type
|
str
|
Prediction model type, either "HCD" or "CID". |
required |
collision_energies
|
list of float
|
List of collision energies, required when model_type is "HCD". |
None
|
remove_pre_nxt_aa
|
bool
|
Whether to remove preceding and next amino acids, default is True. |
False
|
remove_modification
|
bool
|
Whether to remove modifications, default is True. |
required |
url
|
str
|
Koina server URL, default is "koina.wilhelmlab.org:443". |
'koina.wilhelmlab.org:443'
|
top_n
|
int
|
Number of top peaks to use for alignment, default is 12. |
36
|
tolerance_ppm
|
float
|
Mass tolerance for alignment in ppm, default is 20. |
20
|
Source code in optimhc/feature/spectral_similarity.py
id_column
property
¶
Returns a list of input columns required for the feature generator.
feature_columns
property
¶
Returns a list of feature columns generated by the feature generator.
raw_predictions
property
¶
Returns the raw prediction results from Koina.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame. |
input_df()
¶
Return the generated features as a DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the generated features. |
get_raw_predictions()
¶
Get the raw prediction results DataFrame from Koina.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame. |
save_raw_predictions(file_path, **kwargs)
¶
Save the raw prediction results to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to save the file. |
required |
**kwargs
|
Other parameters passed to |
{}
|
Source code in optimhc/feature/spectral_similarity.py
generate_features()
¶
Public interface for generating spectral similarity features.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the generated features. |
Notes
This method is a wrapper around _generate_features that ensures the results are cached and only computed once.
Source code in optimhc/feature/spectral_similarity.py
get_full_data()
¶
Return the full DataFrame with all columns.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Full DataFrame with all columns. |
Notes
This method returns the complete DataFrame including all intermediate results and raw data used in feature generation.
Source code in optimhc/feature/spectral_similarity.py
align_peaks(exp_mz_sorted, exp_intensity_sorted, pred_mz_sorted, tolerance_ppm)
¶
Align sorted experimental peaks to sorted predicted peaks using ppm tolerance.
For each predicted peak, find the experimental peak within the tolerance window that has the highest intensity.
Returns:
| Name | Type | Description |
|---|---|---|
aligned_exp_intensity |
float64 array of length n_pred
|
|
matched_exp_indices |
int64 array of length n_pred (-1 = no match)
|
|
Source code in optimhc/feature/numba_utils.py
compute_similarity_features(exp_vector, pred_vector)
¶
Compute all similarity metrics between two aligned intensity vectors.
Returns:
| Type | Description |
|---|---|
tuple of 8 float64 values:
|
(spectral_angle_similarity, cosine_similarity, pearson_correlation, spearman_correlation, mean_squared_error, unweighted_entropy_similarity, predicted_seen_nonzero, predicted_not_seen) |
Source code in optimhc/feature/numba_utils.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
extract_mzml_data(mzml_filename, scan_ids=None)
¶
Extract scan data from an mzML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mzml_filename
|
str
|
The path to the mzML file. |
required |
scan_ids
|
list[int] or None
|
A list of scan IDs to extract. If None, extracts all scans. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame containing the extracted scan data with columns: - source: The source file name - scan: The scan ID - mz: The m/z values array - intensity: The intensity values array - charge: The charge state - retention_time: The retention time |
Notes
This function: 1. Reads the mzML file using pyteomics 2. Extracts scan data including retention time, charge state, m/z values, and intensities 3. Filters scans based on provided scan IDs if specified 4. Returns a DataFrame with the extracted data
Source code in optimhc/parser/mzml.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
DeepLC¶
deeplc
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
PsmContainer(psms, label_column, scan_column, spectrum_column, ms_data_file_column, peptide_column, protein_column, rescoring_features, hit_rank_column=None, charge_column=None, retention_time_column=None, calculated_mass_column=None, metadata_column=None)
¶
A container for managing peptide-spectrum matches (PSMs) in immunopeptidomics rescoring pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
DataFrame
|
DataFrame containing the PSM data. |
required |
label_column
|
str
|
Column containing the label (True for target, False for decoy). |
required |
scan_column
|
str
|
Column containing the scan number. |
required |
spectrum_column
|
str
|
Column containing the spectrum identifier. |
required |
ms_data_file_column
|
str
|
Column containing the MS data file that the PSM originated from. |
required |
peptide_column
|
str
|
Column containing the peptide sequence. |
required |
protein_column
|
str
|
Column containing the protein accessions. |
required |
rescoring_features
|
dict of str to list of str
|
Dictionary of feature columns for rescoring. |
required |
hit_rank_column
|
str
|
Column containing the hit rank. |
None
|
charge_column
|
str
|
Column containing the charge state. |
None
|
retention_time_column
|
str
|
Column containing the retention time. |
None
|
calculated_mass_column
|
str
|
Column containing the calculated mass. |
None
|
metadata_column
|
str
|
Column containing metadata. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
psms |
DataFrame
|
Copy of the DataFrame containing the PSM data. |
target_psms |
DataFrame
|
DataFrame containing only target PSMs (label = True). |
decoy_psms |
DataFrame
|
DataFrame containing only decoy PSMs (label = False). |
peptides |
list of str
|
List containing all peptides from the PSM data. |
columns |
list of str
|
List of column names in the PSM DataFrame. |
rescoring_features |
dict of str to list of str
|
Dictionary of rescoring feature columns in the PSM DataFrame. |
Source code in optimhc/psm_container.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
psms
property
¶
Get a copy of the PSM DataFrame to prevent external modification.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A copy of the PSM DataFrame. |
target_psms
property
¶
Get a DataFrame containing only target PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with only target PSMs (label = True). |
decoy_psms
property
¶
Get a DataFrame containing only decoy PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with only decoy PSMs (label = False). |
columns
property
¶
Get the column names of the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of column names. |
feature_columns
property
¶
Get a list of all feature columns in the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of feature column names. |
feature_sources
property
¶
Get a list of all feature sources in the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of feature source names. |
peptides
property
¶
Get the peptide sequences from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of peptide sequences. |
ms_data_files
property
¶
Get the MS data files from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of MS data file names. |
scan_ids
property
¶
Get the scan numbers from the PSM data.
Returns:
| Type | Description |
|---|---|
list of int
|
List of scan numbers. |
charges
property
¶
Get the charge states from the PSM data.
Returns:
| Type | Description |
|---|---|
list of int
|
List of charge states. |
metadata
property
¶
Get the metadata from the PSM data.
Returns:
| Type | Description |
|---|---|
Series
|
Series containing metadata for each PSM. |
spectrum_ids
property
¶
Get the spectrum identifiers from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of spectrum identifiers. |
identifier_columns
property
¶
Get the columns that uniquely identify each PSM.
Returns:
| Type | Description |
|---|---|
list of str
|
List of identifier column names. |
__len__()
¶
copy()
¶
Return a deep copy of the PsmContainer object.
Returns:
| Type | Description |
|---|---|
PsmContainer
|
A deep copy of the current PsmContainer. |
__repr__()
¶
Return a string representation of the PsmContainer.
Returns:
| Type | Description |
|---|---|
str
|
String summary of the PsmContainer. |
Source code in optimhc/psm_container.py
drop_features(features)
¶
Drop specified features from the PSM DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
list of str
|
List of feature column names to drop. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any of the features do not exist in the DataFrame. |
Source code in optimhc/psm_container.py
drop_source(source)
¶
Drop all features associated with a specific source from the PSM DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Name of the source to drop. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source does not exist in the rescoring features. |
Source code in optimhc/psm_container.py
add_metadata(metadata_df, psms_key, metadata_key, source)
¶
Merge new metadata into the PSM DataFrame based on specified columns. Metadata from the specified source is stored as a nested dictionary inside the metadata column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_df
|
DataFrame
|
DataFrame containing new metadata to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
metadata_key
|
str or list of str
|
Column name(s) in the metadata data to merge on. |
required |
source
|
str
|
Name of the source of the new metadata. |
required |
Source code in optimhc/psm_container.py
get_top_hits(n=1)
¶
Get the top n hits based on the hit rank column. If the hit rank column is not specified, returns the original PSMs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
The number of top hits to return. Default is 1. |
1
|
Returns:
| Type | Description |
|---|---|
PsmContainer
|
A new PsmContainer object containing the top n hits. |
Source code in optimhc/psm_container.py
add_features(features_df, psms_key, feature_key, source, suffix=None)
¶
Merge new features into the PSM DataFrame based on specified columns.
This method performs a left join between the PSM data and feature data, ensuring that all PSMs are preserved while adding new features. It handles column name conflicts through optional suffixing and maintains feature source tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
DataFrame containing new features to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
feature_key
|
str or list of str
|
Column name(s) in the features data to merge on. |
required |
source
|
str
|
Name of the source of the new features (e.g., 'deeplc', 'netmhc'). |
required |
suffix
|
str
|
Suffix to add to the new columns if there's a name conflict. Required when new feature columns have the same names as existing columns. For example, if adding features from different sources (e.g., 'score' from DeepLC and NetMHC), use suffixes like '_deeplc' or '_netmhc' to distinguish them. |
None
|
Returns:
| Type | Description |
|---|---|
None
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If duplicate columns exist without suffix. If merging features changes the number of PSMs. |
Notes
The method follows these steps: 1. Validates input and prepares merge keys 2. Checks for column name conflicts 3. Manages feature source: if the source already exists, it will be overwritten 4. Performs left join merge 5. Verifies data integrity
Suffix Usage
The suffix parameter is used to handle column name conflicts: - When adding features from different sources that might have the same column names - When you want to keep both the original and new features with the same name - When you need to track the source of features in the column names
If suffix is not provided and there are duplicate column names: - The method will raise a ValueError - You must either provide a suffix or rename the columns before adding
Examples:
>>> container = PsmContainer(...)
>>> # Adding features without suffix (no conflicts)
>>> features_df1 = pd.DataFrame({
... 'scan': [1, 2, 3],
... 'feature1': [0.1, 0.2, 0.3],
... 'feature2': [0.4, 0.5, 0.6]
... })
>>> container.add_features(
... features_df1,
... psms_key='scan',
... feature_key='scan',
... source='source1'
... )
>>> # Adding features with suffix (handling conflicts)
>>> features_df2 = pd.DataFrame({
... 'scan': [1, 2, 3],
... 'score': [0.8, 0.9, 0.7], # This would conflict with existing 'score'
... 'feature3': [0.7, 0.8, 0.9]
... })
>>> container.add_features(
... features_df2,
... psms_key='scan',
... feature_key='scan',
... source='source2',
... suffix='_new' # 'score' becomes 'score_new'
... )
Source code in optimhc/psm_container.py
481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 | |
add_features_by_index(features_df, source, suffix=None)
¶
Merge new features into the PSM DataFrame based on the DataFrame index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
DataFrame containing new features to add. |
required |
source
|
str
|
Name of the source of the new features. |
required |
suffix
|
str
|
Suffix to add to the new columns if there's a name conflict. |
None
|
Source code in optimhc/psm_container.py
add_results(results_df, psms_key, result_key)
¶
Add results of rescore engine to the PSM DataFrame based on specified columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_df
|
DataFrame
|
DataFrame containing new results to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
result_key
|
str or list of str
|
Column name(s) in the results data to merge on. |
required |
Source code in optimhc/psm_container.py
write_pin(output_file, style='default', source=None)
¶
Write the PSM data to a Percolator PIN file, supporting both generic Percolator and MSBooster-compatible formats. The style parameter is actually used to output a unified pin format file to benchmark the performance of different rescoring methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_file
|
str
|
Path to the output PIN file. |
required |
style
|
str
|
If set to 'msbooster', outputs only the columns required by MSBooster (SpecId, Label, ScanNr, retentiontime, rank, hyperscore or log10_evalue, Peptide, Proteins).
If set to 'default', outputs all features specified in |
'default'
|
source
|
list of str
|
List of feature sources to include. If None, includes all sources. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The DataFrame written to the PIN file. |
Notes
- The first three columns are always: SpecID, Label, ScanNr.
- For 'msbooster' style, the columns are: SpecId, Label, ScanNr, retentiontime, rank, hyperscore or log10_evalue, Peptide, Proteins.
- If
hit_rank_columnis not specified, rank is set to 1 for all rows. - Either 'hyperscore' or 'expect' must be present in features; for 'expect', the column is written as 'log10_evalue'.
- The 'log10_evalue' column should contain the base-10 logarithm of the e-value.
- The 'Peptide' column is formatted with underscores (e.g.,
_.PEPTIDE._). - For standard format, all features from
rescoring_featuresare appended between ScanNr and Peptide columns. - The 'Proteins' column is a semicolon-separated list if stored as a list or tuple.
- Label column is converted to 1 (target) and -1 (decoy), as required by Percolator.
Example output (default style): SpecId Label ScanNr feature1 feature2 ... Peptide Proteins
Example output (msbooster style): SpecId Label ScanNr retentiontime rank hyperscore Peptide Proteins or SpecId Label ScanNr retentiontime rank log10_evalue Peptide Proteins
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing for the selected style. |
Source code in optimhc/psm_container.py
746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 | |
DeepLCFeatureGenerator(psms, calibration_criteria_column, lower_score_is_better=False, calibration_set_size=None, processes=1, model_path=None, remove_pre_nxt_aa=True, mod_dict=None, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Generate DeepLC-based features for rescoring.
This generator uses DeepLC to predict retention times and calculates various features based on the differences between predicted and observed retention times.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
PSMs to generate features for. |
required |
calibration_criteria_column
|
str
|
Column name in the PSMs DataFrame to use for DeepLC calibration. |
required |
lower_score_is_better
|
bool
|
Whether a lower PSM score denotes a better matching PSM. Default is False. |
False
|
calibration_set_size
|
int or float
|
Amount of best PSMs to use for DeepLC calibration. If this value is lower than the number of available PSMs, all PSMs will be used. Default is 0.15. |
None
|
processes
|
int
|
Number of processes to use in DeepLC. Default is 1. |
1
|
model_path
|
str
|
Path to the DeepLC model. If None, the default model will be used. |
None
|
remove_pre_nxt_aa
|
bool
|
Whether to remove the first and last amino acids from the peptide sequence. Default is True. |
True
|
mod_dict
|
dict
|
Dictionary of modifications to be used for DeepLC. If None, no modifications will be used. |
None
|
Notes
DeepLC retraining is on by default. Add deeplc_retrain: False as a keyword
argument to disable retraining.
The generated features include: - observed_retention_time: Original retention time from the data - predicted_retention_time: DeepLC predicted retention time - retention_time_diff: Difference between predicted and observed times - abs_retention_time_diff: Absolute difference between predicted and observed times - retention_time_ratio: Ratio of min(pred,obs) to max(pred,obs)
Generate DeepLC-based features for rescoring.
DeepLC retraining is on by default. Add deeplc_retrain: False as a keyword argument to
disable retraining.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
PSMs to generate features for. |
required |
calibration_criteria_column
|
str
|
Column name in the PSMs DataFrame to use for DeepLC calibration. |
required |
lower_score_is_better
|
bool
|
Whether a lower PSM score denotes a better matching PSM. Default: False. |
False
|
calibration_set_size
|
int or float
|
Amount of best PSMs to use for DeepLC calibration. If this value is lower than the number of available PSMs, all PSMs will be used. (default: 0.15) |
None
|
processes
|
int or None
|
Number of processes to use in DeepLC. Defaults to 1. |
1
|
model_path
|
str
|
Path to the DeepLC model. If None, the default model will be used. |
None
|
remove_pre_nxt_aa
|
bool
|
Whether to remove the first and last amino acids from the peptide sequence. Default: True. |
True
|
mod_dict
|
dict
|
Dictionary of modifications to be used for DeepLC. If None, no modifications will be used. |
None
|
*args
|
list
|
Additional positional arguments are passed to DeepLC. |
()
|
**kwargs
|
dict
|
Additional keyword arguments are passed to DeepLC. |
{}
|
Source code in optimhc/feature/deeplc.py
feature_columns
property
¶
Return the list of generated feature column names.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of feature column names: - observed_retention_time - predicted_retention_time - retention_time_diff - abs_retention_time_diff - retention_time_ratio |
id_column
property
¶
Return the list of input columns required for the feature generator.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of input columns required for feature generation. Currently returns an empty list as the required columns are handled internally by the PsmContainer. |
raw_predictions
property
¶
Get the raw predictions DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the raw predictions: - peptide: Cleaned peptide sequence - predicted_rt: DeepLC predicted retention time - observed_rt: Original retention time - modifications: Unimod format modifications |
Notes
If predictions haven't been generated yet, this will trigger feature generation automatically.
generate_features()
¶
Generate DeepLC features for the provided PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the PSMs with added DeepLC features: - original_seq: Original peptide sequence - observed_retention_time: Original retention time - predicted_retention_time: DeepLC predicted retention time - retention_time_diff: Difference between predicted and observed times - abs_retention_time_diff: Absolute difference between predicted and observed times - retention_time_ratio: Ratio of min(pred,obs) to max(pred,obs) |
Notes
This method: 1. Prepares data in DeepLC format 2. Calibrates DeepLC if calibration set is specified 3. Predicts retention times 4. Calculates various retention time-based features 5. Handles missing values by imputing with median values
Source code in optimhc/feature/deeplc.py
230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 | |
get_full_data()
¶
Get the full DeepLC DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the DeepLC input data with all columns: - original_seq: Original peptide sequence - label: Target/decoy label - seq: Cleaned peptide sequence - modifications: Unimod format modifications - tr: Retention time - score: Calibration criteria score - predicted_retention_time: DeepLC predicted retention time - retention_time_diff: Difference between predicted and observed times |
Source code in optimhc/feature/deeplc.py
get_raw_predictions()
¶
Get the raw predictions DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the raw predictions: - peptide: Cleaned peptide sequence - predicted_rt: DeepLC predicted retention time - observed_rt: Original retention time - modifications: Unimod format modifications |
Notes
This is a convenience method that returns the same data as the raw_predictions property.
Source code in optimhc/feature/deeplc.py
save_raw_predictions(file_path, **kwargs)
¶
Save the raw prediction results to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to save the file. |
required |
**kwargs
|
dict
|
Additional parameters passed to pandas.DataFrame.to_csv. If 'index' is not specified, it defaults to False. |
{}
|
Notes
This method saves the raw predictions DataFrame to a CSV file. The DataFrame includes: - peptide: Cleaned peptide sequence - predicted_rt: DeepLC predicted retention time - observed_rt: Original retention time - modifications: Unimod format modifications
Source code in optimhc/feature/deeplc.py
Overlapping Peptide¶
overlapping_peptide
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
PsmContainer(psms, label_column, scan_column, spectrum_column, ms_data_file_column, peptide_column, protein_column, rescoring_features, hit_rank_column=None, charge_column=None, retention_time_column=None, calculated_mass_column=None, metadata_column=None)
¶
A container for managing peptide-spectrum matches (PSMs) in immunopeptidomics rescoring pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
DataFrame
|
DataFrame containing the PSM data. |
required |
label_column
|
str
|
Column containing the label (True for target, False for decoy). |
required |
scan_column
|
str
|
Column containing the scan number. |
required |
spectrum_column
|
str
|
Column containing the spectrum identifier. |
required |
ms_data_file_column
|
str
|
Column containing the MS data file that the PSM originated from. |
required |
peptide_column
|
str
|
Column containing the peptide sequence. |
required |
protein_column
|
str
|
Column containing the protein accessions. |
required |
rescoring_features
|
dict of str to list of str
|
Dictionary of feature columns for rescoring. |
required |
hit_rank_column
|
str
|
Column containing the hit rank. |
None
|
charge_column
|
str
|
Column containing the charge state. |
None
|
retention_time_column
|
str
|
Column containing the retention time. |
None
|
calculated_mass_column
|
str
|
Column containing the calculated mass. |
None
|
metadata_column
|
str
|
Column containing metadata. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
psms |
DataFrame
|
Copy of the DataFrame containing the PSM data. |
target_psms |
DataFrame
|
DataFrame containing only target PSMs (label = True). |
decoy_psms |
DataFrame
|
DataFrame containing only decoy PSMs (label = False). |
peptides |
list of str
|
List containing all peptides from the PSM data. |
columns |
list of str
|
List of column names in the PSM DataFrame. |
rescoring_features |
dict of str to list of str
|
Dictionary of rescoring feature columns in the PSM DataFrame. |
Source code in optimhc/psm_container.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
psms
property
¶
Get a copy of the PSM DataFrame to prevent external modification.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A copy of the PSM DataFrame. |
target_psms
property
¶
Get a DataFrame containing only target PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with only target PSMs (label = True). |
decoy_psms
property
¶
Get a DataFrame containing only decoy PSMs.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with only decoy PSMs (label = False). |
columns
property
¶
Get the column names of the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of column names. |
feature_columns
property
¶
Get a list of all feature columns in the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of feature column names. |
feature_sources
property
¶
Get a list of all feature sources in the PSM DataFrame.
Returns:
| Type | Description |
|---|---|
list of str
|
List of feature source names. |
peptides
property
¶
Get the peptide sequences from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of peptide sequences. |
ms_data_files
property
¶
Get the MS data files from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of MS data file names. |
scan_ids
property
¶
Get the scan numbers from the PSM data.
Returns:
| Type | Description |
|---|---|
list of int
|
List of scan numbers. |
charges
property
¶
Get the charge states from the PSM data.
Returns:
| Type | Description |
|---|---|
list of int
|
List of charge states. |
metadata
property
¶
Get the metadata from the PSM data.
Returns:
| Type | Description |
|---|---|
Series
|
Series containing metadata for each PSM. |
spectrum_ids
property
¶
Get the spectrum identifiers from the PSM data.
Returns:
| Type | Description |
|---|---|
list of str
|
List of spectrum identifiers. |
identifier_columns
property
¶
Get the columns that uniquely identify each PSM.
Returns:
| Type | Description |
|---|---|
list of str
|
List of identifier column names. |
__len__()
¶
copy()
¶
Return a deep copy of the PsmContainer object.
Returns:
| Type | Description |
|---|---|
PsmContainer
|
A deep copy of the current PsmContainer. |
__repr__()
¶
Return a string representation of the PsmContainer.
Returns:
| Type | Description |
|---|---|
str
|
String summary of the PsmContainer. |
Source code in optimhc/psm_container.py
drop_features(features)
¶
Drop specified features from the PSM DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
list of str
|
List of feature column names to drop. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any of the features do not exist in the DataFrame. |
Source code in optimhc/psm_container.py
drop_source(source)
¶
Drop all features associated with a specific source from the PSM DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Name of the source to drop. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source does not exist in the rescoring features. |
Source code in optimhc/psm_container.py
add_metadata(metadata_df, psms_key, metadata_key, source)
¶
Merge new metadata into the PSM DataFrame based on specified columns. Metadata from the specified source is stored as a nested dictionary inside the metadata column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_df
|
DataFrame
|
DataFrame containing new metadata to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
metadata_key
|
str or list of str
|
Column name(s) in the metadata data to merge on. |
required |
source
|
str
|
Name of the source of the new metadata. |
required |
Source code in optimhc/psm_container.py
get_top_hits(n=1)
¶
Get the top n hits based on the hit rank column. If the hit rank column is not specified, returns the original PSMs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
The number of top hits to return. Default is 1. |
1
|
Returns:
| Type | Description |
|---|---|
PsmContainer
|
A new PsmContainer object containing the top n hits. |
Source code in optimhc/psm_container.py
add_features(features_df, psms_key, feature_key, source, suffix=None)
¶
Merge new features into the PSM DataFrame based on specified columns.
This method performs a left join between the PSM data and feature data, ensuring that all PSMs are preserved while adding new features. It handles column name conflicts through optional suffixing and maintains feature source tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
DataFrame containing new features to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
feature_key
|
str or list of str
|
Column name(s) in the features data to merge on. |
required |
source
|
str
|
Name of the source of the new features (e.g., 'deeplc', 'netmhc'). |
required |
suffix
|
str
|
Suffix to add to the new columns if there's a name conflict. Required when new feature columns have the same names as existing columns. For example, if adding features from different sources (e.g., 'score' from DeepLC and NetMHC), use suffixes like '_deeplc' or '_netmhc' to distinguish them. |
None
|
Returns:
| Type | Description |
|---|---|
None
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If duplicate columns exist without suffix. If merging features changes the number of PSMs. |
Notes
The method follows these steps: 1. Validates input and prepares merge keys 2. Checks for column name conflicts 3. Manages feature source: if the source already exists, it will be overwritten 4. Performs left join merge 5. Verifies data integrity
Suffix Usage
The suffix parameter is used to handle column name conflicts: - When adding features from different sources that might have the same column names - When you want to keep both the original and new features with the same name - When you need to track the source of features in the column names
If suffix is not provided and there are duplicate column names: - The method will raise a ValueError - You must either provide a suffix or rename the columns before adding
Examples:
>>> container = PsmContainer(...)
>>> # Adding features without suffix (no conflicts)
>>> features_df1 = pd.DataFrame({
... 'scan': [1, 2, 3],
... 'feature1': [0.1, 0.2, 0.3],
... 'feature2': [0.4, 0.5, 0.6]
... })
>>> container.add_features(
... features_df1,
... psms_key='scan',
... feature_key='scan',
... source='source1'
... )
>>> # Adding features with suffix (handling conflicts)
>>> features_df2 = pd.DataFrame({
... 'scan': [1, 2, 3],
... 'score': [0.8, 0.9, 0.7], # This would conflict with existing 'score'
... 'feature3': [0.7, 0.8, 0.9]
... })
>>> container.add_features(
... features_df2,
... psms_key='scan',
... feature_key='scan',
... source='source2',
... suffix='_new' # 'score' becomes 'score_new'
... )
Source code in optimhc/psm_container.py
481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 | |
add_features_by_index(features_df, source, suffix=None)
¶
Merge new features into the PSM DataFrame based on the DataFrame index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
DataFrame containing new features to add. |
required |
source
|
str
|
Name of the source of the new features. |
required |
suffix
|
str
|
Suffix to add to the new columns if there's a name conflict. |
None
|
Source code in optimhc/psm_container.py
add_results(results_df, psms_key, result_key)
¶
Add results of rescore engine to the PSM DataFrame based on specified columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_df
|
DataFrame
|
DataFrame containing new results to add. |
required |
psms_key
|
str or list of str
|
Column name(s) in the PSM data to merge on. |
required |
result_key
|
str or list of str
|
Column name(s) in the results data to merge on. |
required |
Source code in optimhc/psm_container.py
write_pin(output_file, style='default', source=None)
¶
Write the PSM data to a Percolator PIN file, supporting both generic Percolator and MSBooster-compatible formats. The style parameter is actually used to output a unified pin format file to benchmark the performance of different rescoring methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_file
|
str
|
Path to the output PIN file. |
required |
style
|
str
|
If set to 'msbooster', outputs only the columns required by MSBooster (SpecId, Label, ScanNr, retentiontime, rank, hyperscore or log10_evalue, Peptide, Proteins).
If set to 'default', outputs all features specified in |
'default'
|
source
|
list of str
|
List of feature sources to include. If None, includes all sources. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The DataFrame written to the PIN file. |
Notes
- The first three columns are always: SpecID, Label, ScanNr.
- For 'msbooster' style, the columns are: SpecId, Label, ScanNr, retentiontime, rank, hyperscore or log10_evalue, Peptide, Proteins.
- If
hit_rank_columnis not specified, rank is set to 1 for all rows. - Either 'hyperscore' or 'expect' must be present in features; for 'expect', the column is written as 'log10_evalue'.
- The 'log10_evalue' column should contain the base-10 logarithm of the e-value.
- The 'Peptide' column is formatted with underscores (e.g.,
_.PEPTIDE._). - For standard format, all features from
rescoring_featuresare appended between ScanNr and Peptide columns. - The 'Proteins' column is a semicolon-separated list if stored as a list or tuple.
- Label column is converted to 1 (target) and -1 (decoy), as required by Percolator.
Example output (default style): SpecId Label ScanNr feature1 feature2 ... Peptide Proteins
Example output (msbooster style): SpecId Label ScanNr retentiontime rank hyperscore Peptide Proteins or SpecId Label ScanNr retentiontime rank log10_evalue Peptide Proteins
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing for the selected style. |
Source code in optimhc/psm_container.py
746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 | |
OverlappingPeptideFeatureGenerator(peptides, min_overlap_length=6, min_length=7, max_length=60, min_entropy=0, fill_missing='median', remove_pre_nxt_aa=False, remove_modification=True, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Generates features based on peptide sequence overlaps using the Overlap-Layout-Consensus (OLC) algorithm.
This generator constructs an overlap graph of peptides, removes transitive edges, simplifies the graph to contigs, and computes features such as the number of overlaps, log-transformed overlap counts, overlap ranks, and contig lengths. It also filters out peptides with low entropy or outlier lengths before processing. Additionally, it records detailed information about brother peptides and contigs, accessible via the get_all_data method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
list of str
|
List of peptide sequences. |
required |
min_overlap_length
|
int
|
Minimum required overlap length for peptides to be considered overlapping. Default is 6. |
6
|
min_length
|
int
|
Minimum peptide length to include in processing. Default is 7. |
7
|
max_length
|
int
|
Maximum peptide length to include in processing. Default is 60. |
60
|
min_entropy
|
float
|
Minimum Shannon entropy for peptides to include in processing. Default is 0. |
0
|
fill_missing
|
str
|
Method to fill missing values for filtered peptides. Options are 'median' or 'zero'. Default is 'median'. |
'median'
|
remove_pre_nxt_aa
|
bool
|
Whether to remove the preceding and following amino acids from peptides. Default is False. |
False
|
remove_modification
|
bool
|
Whether to remove modifications from peptides. Default is True. |
True
|
Attributes:
| Name | Type | Description |
|---|---|---|
original_peptides |
list of str
|
Original list of peptide sequences. |
min_overlap_length |
int
|
Minimum required overlap length. |
min_length |
int
|
Minimum peptide length. |
max_length |
int
|
Maximum peptide length. |
min_entropy |
float
|
Minimum Shannon entropy. |
fill_missing |
str
|
Method to fill missing values. |
remove_pre_nxt_aa |
bool
|
Whether to remove preceding and following amino acids. |
remove_modification |
bool
|
Whether to remove modifications. |
filtered_peptides |
list of str
|
List of peptides after filtering. |
filtered_indices |
list of int
|
Indices of filtered peptides. |
peptide_to_index |
dict of str to int
|
Mapping of peptides to their indices. |
overlap_data |
DataFrame
|
DataFrame containing overlap data. |
peptide_to_contig |
dict of str to int
|
Mapping of peptides to their contig indices. |
assembled_contigs |
list of dict
|
List of assembled contigs. |
full_data |
DataFrame
|
Full data including brother peptides and contig information. |
_overlap_graph |
DiGraph
|
Overlap graph. |
_simplified_graph |
DiGraph
|
Simplified graph with transitive edges removed. |
Notes
Key Data Structures: 1. contigs: List[List[str]] - Represents non-branching paths in the overlap graph - Each inner list contains peptide sequences that form a continuous chain - Example: [['PEPTIDE1', 'PEPTIDE2'], ['PEPTIDE3']]
2. assembled_contigs: List[Dict]
- Contains the assembled sequences and their constituent peptides
- Each dictionary has two keys:
'sequence': The merged/assembled sequence of overlapping peptides
'peptides': List of peptides that were used to build this contig
- Example: [
{
'sequence': 'LONGPEPTIDESEQUENCE',
'peptides': ['LONGPEP', 'PEPTIDE', 'SEQUENCE']
},
{
'sequence': 'SINGLEPEPTIDE',
'peptides': ['SINGLEPEPTIDE']
}
]
3. peptide_to_contig: Dict[str, int]
- Maps each peptide to its contig index in assembled_contigs
- Key: peptide sequence
- Value: index of the contig containing this peptide
- Example: {
'LONGPEP': 0,
'PEPTIDE': 0,
'SEQUENCE': 0,
'SINGLEPEPTIDE': 1
}
4. overlap_graph (_overlap_graph): nx.DiGraph
- Directed graph representing all possible overlaps between peptides
- Nodes: peptide sequences
- Edges: overlaps between peptides
- Edge weights: length of overlap
5. simplified_graph (_simplified_graph): nx.DiGraph
- Simplified version of overlap_graph with transitive edges removed
- Used for final contig assembly
- More efficient representation of essential overlaps
Source code in optimhc/feature/overlapping_peptide.py
id_column
property
¶
Returns a list of input columns required for the feature generator.
Returns: List[str]: List of input columns.
feature_columns
property
¶
Returns the feature column names.
overlap_graph
property
¶
Returns the overlap graph.
simplified_graph
property
¶
Returns the layout graph.
contigs
property
¶
Returns the assembled contigs.
generate_features()
¶
Generates features for peptide overlaps, including the count of overlapping peptides, contig length, and log-transformed counts and ranks.
Returns: pd.DataFrame: DataFrame containing the features.
Source code in optimhc/feature/overlapping_peptide.py
get_full_data()
¶
Returns the full data including brother peptides and contig information for each peptide. In the output, the lists of contig peptides and brother peptides include redundant peptides, so that their counts match the corresponding peptide and contig_member_count.
Returns: pd.DataFrame: DataFrame containing peptides and their brother peptides and contigs.
Source code in optimhc/feature/overlapping_peptide.py
assign_brother_aggregated_feature(psms, feature_columns, overlapping_source, source_name='OverlappingGroupFeatures')
¶
Assign aggregated features based on brother peptides to the PSMs.
For PSMs with the same ContigSequence (brother peptides), compute the mean of specified features and assign these aggregated features back to each PSM in the group. Additionally, compute the sum as mean * (contig_member_count + 1). If a PSM does not have a ContigSequence (no brothers), its new features will be set to the original values.
Parameters: psms (PsmContainer): PSM container containing the peptides and features. feature_columns (Union[str, List[str]]): Name of the feature column(s) to aggregate. overlapping_source (str): Source name of the overlapping peptide features. source_name (str): Name of the new feature source.
Returns: None
Source code in optimhc/feature/overlapping_peptide.py
729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 | |
PWM¶
pwm
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
c_flank_pwm_data = {'A': [0.891194, 0.599125, 0.567353], 'C': [-3.952777, -4.34522, -4.612584], 'D': [0.291173, 0.550133, 0.32569], 'E': [0.687212, 0.662834, 0.834717], 'F': [-1.250652, -0.784627, -1.139232], 'G': [0.509354, -0.36837, 0.919885], 'H': [-0.808229, -0.836628, -0.591508], 'I': [-0.046196, -0.034123, -1.56436], 'K': [0.452471, 0.665617, 1.086677], 'L': [0.522681, 0.382607, 0.208291], 'M': [-2.131278, -2.144693, -2.008653], 'N': [-0.208044, -0.699085, 0.022668], 'P': [0.417673, 1.179052, -0.01885], 'Q': [-0.033656, -0.051822, 0.274208], 'R': [0.535173, 0.397917, 0.924583], 'S': [0.542669, 0.255087, 0.360228], 'T': [-0.359617, -0.322741, -0.028565], 'V': [0.507861, 0.582748, -0.584754], 'W': [-3.432776, -3.024952, -3.391619], 'Y': [-1.1447, -0.725718, -1.221789], 'X': [0.0, 0.0, 0.0]}
module-attribute
¶
c_flank_pwm = pd.DataFrame.from_dict(c_flank_pwm_data, orient='index', columns=['Pos1', 'Pos2', 'Pos3'])
module-attribute
¶
n_flank_pwm_data = {'A': [0.672938, 0.494511, 0.290216], 'C': [-5.464582, -5.140732, -5.112201], 'D': [0.85685, 0.732683, 0.398964], 'E': [0.692225, 0.660452, 0.746641], 'F': [-1.024461, -1.529751, -0.687119], 'G': [0.873872, 0.746332, 0.630604], 'H': [-1.386627, -1.212169, -1.011943], 'I': [0.138351, -0.461093, 0.095419], 'K': [0.801095, 0.639492, 0.847272], 'L': [0.56242, -0.162599, 0.201511], 'M': [-2.230132, -2.754557, -2.397489], 'N': [-0.198452, -0.099572, -0.214853], 'P': [-1.491966, 1.405721, 0.53271], 'Q': [-0.622442, -0.151155, -0.006518], 'R': [0.216375, 0.217991, 0.545768], 'S': [0.623057, 0.416164, 0.236042], 'T': [0.160517, 0.011151, -0.111494], 'V': [0.52378, 0.239183, 0.330202], 'W': [-3.27634, -4.050898, -2.959115], 'Y': [-1.06052, -1.689795, -0.674071], 'X': [0.0, 0.0, 0.0]}
module-attribute
¶
n_flank_pwm = pd.DataFrame.from_dict(n_flank_pwm_data, orient='index', columns=['Pos1', 'Pos2', 'Pos3'])
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
PWMFeatureGenerator(peptides, alleles, anchors=2, mhc_class='I', pwm_path=None, remove_pre_nxt_aa=False, remove_modification=True, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Generates PWM (Position Weight Matrix) features for peptides based on specified MHC alleles.
This generator calculates PWM scores for each peptide against the provided MHC class I or II allele PWMs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
list of str
|
Series of peptide sequences. |
required |
alleles
|
list of str
|
List of MHC allele names (e.g., ['HLA-A01:01', 'HLA-B07:02']). |
required |
anchors
|
int
|
Number of anchor positions to consider for MHC class I. Default is 2. |
2
|
mhc_class
|
str
|
MHC class, either 'I' or 'II'. Default is 'I'. |
'I'
|
pwm_path
|
str or PathLike
|
Custom path to PWM files. Defaults to '../../data/PWMs'. |
None
|
remove_pre_nxt_aa
|
bool
|
Whether to include the previous and next amino acids in peptides. If True, remove them. Default is False. |
False
|
remove_modification
|
bool
|
Whether to include modifications in peptides. If True, remove them. Default is True. |
True
|
Attributes:
| Name | Type | Description |
|---|---|---|
peptides |
Series
|
Series of peptide sequences. |
alleles |
list of str
|
List of MHC allele names. |
mhc_class |
str
|
MHC class ('I' or 'II'). |
pwm_path |
str or PathLike
|
Path to PWM files. |
pwms |
dict
|
Dictionary of PWMs for each allele and mer length. |
anchors |
int
|
Number of anchor positions for MHC class I. |
remove_pre_nxt_aa |
bool
|
Whether to remove pre/post neighboring amino acids. |
remove_modification |
bool
|
Whether to remove modifications. |
Notes
For MHC class I: - Generates 'PWM_Score_{allele}' and optionally 'Anchor_Score_{allele}' columns. For MHC class II: - Generates 'PWM_Score_{allele}' (core 9-mer), - 'N_Flank_PWM_Score_{allele}', - 'C_Flank_PWM_Score_{allele}' columns.
Initializes the PWMFeatureGenerator.
Parameters: peptides (List[str]): Series of peptide sequences. alleles (List[str]): List of MHC allele names (e.g., ['HLA-A01:01', 'HLA-B07:02']). mhc_class (str): MHC class, either 'I' or 'II'. Default is 'I'. pwm_path (Optional[Union[str, os.PathLike]]): Custom path to PWM files. Defaults to '../../data/PWMs'. remove_pre_nxt_aa (bool): Whether to include the previous and next amino acids in peptides. If True, remove them. Default is False. remove_modification (bool): Whether to include modifications in peptides. If True, remove them. Default is True.
Source code in optimhc/feature/pwm.py
id_column
property
¶
Get a list of input columns required for the feature generator.
Returns:
| Type | Description |
|---|---|
list of str
|
List of column names required for feature generation. |
feature_columns
property
¶
Returns a list of feature names generated by the feature generator.
set_pwms(pwms)
¶
Set PWMs directly, allowing for custom PWMs to be provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pwms
|
dict of str to dict of int to pd.DataFrame
|
Dictionary of PWMs for each allele and mer length. Format: {allele: {mer_length: pwm_dataframe}} |
required |
Source code in optimhc/feature/pwm.py
generate_features()
¶
Generate PWM features for all peptides across specified alleles.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing generated features: For MHC class I: - 'PWM_Score_{allele}' and optionally 'Anchor_Score_{allele}' columns. For MHC class II: - 'PWM_Score_{allele}' (core 9-mer), - 'N_Flank_PWM_Score_{allele}', - 'C_Flank_PWM_Score_{allele}' columns. |
Notes
Missing values are imputed with the median value for each feature.
Source code in optimhc/feature/pwm.py
575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 | |
MHCflurry¶
mhcflurry
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
MHCflurryFeatureGenerator(peptides, alleles, remove_pre_nxt_aa=False, remove_modification=True, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Generate MHCflurry features for peptides based on specified MHC class I alleles.
This generator calculates MHCflurry presentation scores for each peptide against the provided MHC class I alleles.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
List[str]
|
List of peptide sequences. |
required |
alleles
|
List[str]
|
List of MHC allele names (e.g., ['HLA-A01:01', 'HLA-B07:02']). |
required |
remove_pre_nxt_aa
|
bool
|
Whether to include the previous and next amino acids in peptides. If True, remove them. Default is True. |
False
|
remove_modification
|
bool
|
Whether to include modifications in peptides. If True, remove them. Default is True. |
True
|
Notes
The generated features include: - mhcflurry_affinity: Binding affinity score - mhcflurry_processing_score: Processing score - mhcflurry_presentation_score: Presentation score - mhcflurry_presentation_percentile: Presentation percentile
Source code in optimhc/feature/mhcflurry.py
feature_columns
property
¶
Return the list of generated feature column names.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of feature column names: - mhcflurry_affinity - mhcflurry_processing_score - mhcflurry_presentation_score - mhcflurry_presentation_percentile |
id_column
property
¶
Return the list of input columns required for the feature generator.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of input column names. |
raw_predictions
property
¶
Return the raw predictions DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the raw predictions: - peptide: Cleaned peptide sequence - allele: MHC allele - affinity: Binding affinity - processing_score: Processing score - presentation_score: Presentation score - presentation_percentile: Presentation percentile |
get_raw_predictions()
¶
Get the raw prediction results DataFrame from MHCflurry.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame containing: - peptide: Cleaned peptide sequence - allele: MHC allele - affinity: Binding affinity - processing_score: Processing score - presentation_score: Presentation score - presentation_percentile: Presentation percentile |
Source code in optimhc/feature/mhcflurry.py
save_raw_predictions(file_path, **kwargs)
¶
Save the raw prediction results to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to save the file. |
required |
**kwargs
|
dict
|
Additional parameters passed to pandas.DataFrame.to_csv. If 'index' is not specified, it defaults to False. |
{}
|
Notes
This method saves the raw predictions DataFrame to a CSV file. The DataFrame includes: - peptide: Cleaned peptide sequence - allele: MHC allele - affinity: Binding affinity - processing_score: Processing score - presentation_score: Presentation score - presentation_percentile: Presentation percentile
Source code in optimhc/feature/mhcflurry.py
generate_features()
¶
Generate MHCflurry features for the provided peptides and alleles.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the peptides and their predicted MHCflurry features: - Peptide: Original peptide sequence - mhcflurry_affinity: Binding affinity - mhcflurry_processing_score: Processing score - mhcflurry_presentation_score: Presentation score - mhcflurry_presentation_percentile: Presentation percentile |
Notes
This method: 1. Runs MHCflurry predictions 2. Renames columns to include 'mhcflurry_' prefix 3. Fills missing values with median values 4. Returns the final feature DataFrame
Source code in optimhc/feature/mhcflurry.py
get_best_allele()
¶
Get the best allele for each peptide.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the best alleles for the peptides: - Peptide: Original peptide sequence - mhcflurry_best_allele: Best binding allele |
Notes
The best allele is determined by the lowest presentation percentile rank.
Source code in optimhc/feature/mhcflurry.py
predictions_to_dataframe()
¶
Convert the predictions to a DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the predictions. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no predictions are available. |
Source code in optimhc/feature/mhcflurry.py
NetMHCpan¶
netmhcpan
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
NetMHCpanFeatureGenerator(peptides, alleles, mode='best', remove_pre_nxt_aa=False, remove_modification=True, n_processes=1, show_progress=False, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Generate NetMHCpan features for peptides based on specified MHC class I alleles.
This generator calculates NetMHCpan binding predictions for each peptide against the provided MHC class I alleles.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
List[str]
|
List of peptide sequences. |
required |
alleles
|
List[str]
|
List of MHC allele names (e.g., ['HLA-A02:01', 'HLA-B07:02']). |
required |
mode
|
str
|
Mode of feature generation. Options: - 'best': Return only the best allele information for each peptide. - 'all': Return predictions for all alleles with allele-specific suffixes plus best allele info. Default is 'best'. |
'best'
|
remove_pre_nxt_aa
|
bool
|
Whether to include the previous and next amino acids in peptides. If True, remove them. Default is True. |
False
|
remove_modification
|
bool
|
Whether to include modifications in peptides. If True, remove them. Default is True. |
True
|
n_processes
|
int
|
Number of processes to use for multiprocessing. Default is 1 (no multiprocessing). |
1
|
show_progress
|
bool
|
Whether to display a progress bar. Default is False. |
False
|
Notes
The generated features include: - netmhcpan_score: Raw binding score - netmhcpan_affinity: Binding affinity in nM - netmhcpan_percentile_rank: Percentile rank of the binding score
Source code in optimhc/feature/netmhcpan.py
feature_columns
property
¶
Return the list of generated feature column names, determined by the mode. Only includes numerical features, excluding any string features like allele names.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of feature column names: - For 'all' mode: netmhcpan_score_{allele}, netmhcpan_affinity_{allele}, netmhcpan_percentile_rank_{allele} for each allele - For both modes: netmhcpan_best_score, netmhcpan_best_affinity, netmhcpan_best_percentile_rank |
id_column
property
¶
Return the list of input columns required for the feature generator.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of input column names. |
raw_predictions
property
¶
Return the raw prediction results from NetMHCpan.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame containing: - peptide: Cleaned peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank |
get_raw_predictions()
¶
Get the raw prediction results DataFrame from NetMHCpan.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame containing: - peptide: Cleaned peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank |
Source code in optimhc/feature/netmhcpan.py
save_raw_predictions(file_path, **kwargs)
¶
Save the raw prediction results to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to save the file. |
required |
**kwargs
|
dict
|
Additional parameters passed to pandas.DataFrame.to_csv. If 'index' is not specified, it defaults to False. |
{}
|
Notes
This method saves the raw predictions DataFrame to a CSV file. The DataFrame includes: - peptide: Cleaned peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank
Source code in optimhc/feature/netmhcpan.py
generate_features()
¶
Generate the final feature table with NetMHCpan features for each peptide.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing peptides and their predicted features: - Peptide: Original peptide sequence - For 'all' mode: netmhcpan_score_{allele}, netmhcpan_affinity_{allele}, netmhcpan_percentile_rank_{allele} for each allele - For both modes: netmhcpan_best_score, netmhcpan_best_affinity, netmhcpan_best_percentile_rank |
Notes
The features generated depend on the mode: - 'best': Only the best allele information for each peptide - 'all': All allele predictions plus best allele information
Missing values are handled consistently by filling with median values for numeric columns.
Source code in optimhc/feature/netmhcpan.py
predictions_to_dataframe()
¶
Convert the predictions to a DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the predictions. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no predictions are available. |
Source code in optimhc/feature/netmhcpan.py
_predict_peptide_chunk(peptides_chunk, alleles)
¶
Predict NetMHCpan scores for a chunk of peptides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides_chunk
|
List[str]
|
List of peptide sequences. |
required |
alleles
|
List[str]
|
List of MHC allele names. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing predictions: - peptide: Peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank |
Source code in optimhc/feature/netmhcpan.py
NetMHCIIpan¶
netmhciipan
¶
feature_generator_factory = FeatureGeneratorFactory()
module-attribute
¶
logger = logging.getLogger(__name__)
module-attribute
¶
BaseFeatureGenerator
¶
Bases: ABC
Abstract base class for all feature generators in the rescoring pipeline.
Subclasses must implement:
- feature_columns -- names of generated feature columns
- id_column -- merge key column(s)
- generate_features() -- pure computation, returns a DataFrame
- from_config() -- construct an instance from pipeline config
The default apply() merges features by peptide column.
Override it for index-based merges, composite keys, or post-processing.
feature_columns
abstractmethod
property
¶
Return a list of feature column names produced by this generator.
id_column
abstractmethod
property
¶
Return the column(s) used as merge key(s) with the PsmContainer.
generate_features()
abstractmethod
¶
from_config(psms, config, params)
classmethod
¶
Construct a generator instance from pipeline configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container with all current data. |
required |
config
|
dict
|
The full pipeline configuration. |
required |
params
|
dict
|
Generator-specific parameters from
|
required |
Source code in optimhc/feature/base_feature_generator.py
apply(psms, source)
¶
Generate features and merge them into the PsmContainer.
The default implementation merges by peptide column using
add_features(). Override for different merge strategies
(index-based, composite key) or additional post-processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
psms
|
PsmContainer
|
The PSM container to add features to (modified in-place). |
required |
source
|
str
|
Name of this feature source (e.g. |
required |
Source code in optimhc/feature/base_feature_generator.py
NetMHCIIpanFeatureGenerator(peptides, alleles, mode='best', remove_pre_nxt_aa=True, remove_modification=True, n_processes=1, show_progress=False, *args, **kwargs)
¶
Bases: BaseFeatureGenerator
Generate NetMHCIIpan features for given peptides based on specified MHC Class II alleles.
This feature generator uses the NetMHCIIpan43_BA interface to predict MHC Class II binding for each peptide and returns scores and features based on the specified parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptides
|
List[str]
|
List of peptide sequences. |
required |
alleles
|
List[str]
|
List of MHC Class II alleles, e.g., ['DRB1_0101', 'DRB1_0102']. |
required |
mode
|
str
|
Feature generation mode. Options: - 'best': Return only the best result for each peptide across all alleles. - 'all': Return prediction results for each peptide across all alleles (with allele-specific column suffixes). Default is 'best'. |
'best'
|
remove_pre_nxt_aa
|
bool
|
Whether to remove the amino acids flanking the peptide (e.g., removing X-AA/AA-X forms). Default is True. |
True
|
remove_modification
|
bool
|
Whether to remove modification information from peptides, e.g., (Phospho). Default is True. |
True
|
n_processes
|
int
|
Number of processes to use. Default is 1 (no multiprocessing). |
1
|
show_progress
|
bool
|
Whether to display a progress bar. Default is False. |
False
|
Notes
The generated features include: - netmhciipan_score: Raw binding score - netmhciipan_affinity: Binding affinity in nM - netmhciipan_percentile_rank: Percentile rank of the binding score
Source code in optimhc/feature/netmhciipan.py
feature_columns
property
¶
Return the list of generated feature column names, determined by the mode. Only includes numerical features, excluding any string features like allele names.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of feature column names: - For 'all' mode: netmhciipan_score_{allele}, netmhciipan_affinity_{allele}, netmhciipan_percentile_rank_{allele} for each allele - For both modes: netmhciipan_best_score, netmhciipan_best_affinity, netmhciipan_best_percentile_rank |
id_column
property
¶
Return the list of input columns required for the feature generator.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of input column names. |
raw_predictions
property
¶
Return the raw prediction results from NetMHCIIpan.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame containing: - peptide: Cleaned peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank |
generate_features()
¶
Generate the final feature table with NetMHCIIpan features for each peptide.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing peptides and their predicted features: - Peptide: Original peptide sequence - For 'all' mode: netmhciipan_score_{allele}, netmhciipan_affinity_{allele}, netmhciipan_percentile_rank_{allele} for each allele - For both modes: netmhciipan_best_score, netmhciipan_best_affinity, netmhciipan_best_percentile_rank |
Notes
The features generated depend on the mode: - 'best': Only the best allele information for each peptide - 'all': All allele predictions plus best allele information
Missing values are handled consistently by filling with median values for numeric columns.
Source code in optimhc/feature/netmhciipan.py
predictions_to_dataframe()
¶
Convert the predictions to a DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the predictions. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no predictions are available. |
Source code in optimhc/feature/netmhciipan.py
get_raw_predictions()
¶
Get the raw prediction results DataFrame from NetMHCIIpan.
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw prediction results DataFrame containing: - peptide: Cleaned peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank |
Source code in optimhc/feature/netmhciipan.py
save_raw_predictions(file_path, **kwargs)
¶
Save the raw prediction results to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to save the file. |
required |
**kwargs
|
dict
|
Additional parameters passed to pandas.DataFrame.to_csv. If 'index' is not specified, it defaults to False. |
{}
|
Notes
This method saves the raw predictions DataFrame to a CSV file. The DataFrame includes: - peptide: Cleaned peptide sequence - allele: MHC allele - score: Raw binding score - affinity: Binding affinity in nM - percentile_rank: Percentile rank
Source code in optimhc/feature/netmhciipan.py
_predict_peptide_chunk_class2(peptides_chunk, alleles)
¶
Use NetMHCIIpan43_BA to predict a batch of peptides (MHC Class II).
Parameters: peptides_chunk (List[str]): A batch of peptide sequences to predict. alleles (List[str]): List of MHC Class II alleles, e.g., ['DRB1_0101', 'DRB1_0102'].
Returns: pd.DataFrame: A DataFrame containing prediction results.