Basic Features¶
The Basic feature (name: Basic) computes simple sequence-level properties of each peptide. These features capture compositional and length characteristics that can help distinguish true identifications from random matches.
Source name: Basic
Output Columns¶
| Column | Description |
|---|---|
length_diff_from_avg |
Peptide length minus the average peptide length across all PSMs |
abs_length_diff_from_avg |
Absolute value of the length difference |
unique_aa_count |
Number of distinct amino acid types in the peptide |
unique_aa_proportion |
Proportion of distinct amino acid types relative to peptide length |
shannon_entropy |
Shannon entropy of the amino acid frequency distribution |
Computation¶
Preprocessing¶
Before computing features, each peptide sequence undergoes:
- Flanking amino acid removal — strips the preceding/next amino acid notation (e.g.,
K.PEPTIDE.R→PEPTIDE). - Modification removal — removes bracketed mass values (e.g.,
PEPTM[147.035]IDE→PEPTMIDE).
Length Features¶
Let \( L \) denote the length of a peptide and \( \bar{L} \) the average peptide length across the entire dataset of \( N \) PSMs:
The length difference \( \delta \) and its absolute value are:
Unique Amino Acid Features¶
Let \( U \) denote the number of distinct amino acid types in a peptide. The unique amino acid proportion \( U_r \) is:
Shannon Entropy¶
The Shannon entropy \( H \) quantifies the diversity of amino acid usage within a peptide:
where \( \mathcal{A} \) is the set of unique amino acids in the peptide and \( p_b \) is the frequency of amino acid \( b \):
with \( n_b \) being the count of amino acid \( b \). A peptide composed of a single repeated amino acid has \( H = 0 \). A peptide with all unique amino acids has maximal entropy \( H = \log_2(L) \).
Configuration¶
No parameters are required. The Basic feature uses default preprocessing settings (remove_pre_nxt_aa: true, remove_modification: true).