Basic Features¶

The Basic feature (name: Basic) computes simple sequence-level properties of each peptide. These features capture compositional and length characteristics that can help distinguish true identifications from random matches.

Source name: Basic

Output Columns¶

Column	Description
`length_diff_from_avg`	Peptide length minus the average peptide length across all PSMs
`abs_length_diff_from_avg`	Absolute value of the length difference
`unique_aa_count`	Number of distinct amino acid types in the peptide
`unique_aa_proportion`	Proportion of distinct amino acid types relative to peptide length
`shannon_entropy`	Shannon entropy of the amino acid frequency distribution

Computation¶

Preprocessing¶

Before computing features, each peptide sequence undergoes:

Flanking amino acid removal — strips the preceding/next amino acid notation (e.g., K.PEPTIDE.R → PEPTIDE).
Modification removal — removes bracketed mass values (e.g., PEPTM[147.035]IDE → PEPTMIDE).

Length Features¶

Let \( L \) denote the length of a peptide and \( \bar{L} \) the average peptide length across the entire dataset of \( N \) PSMs:

\[ \bar{L} = \frac{1}{N} \sum_{i=1}^{N} L_i \]

The length difference \( \delta \) and its absolute value are:

\[ \delta = L - \bar{L}, \quad |\delta| = |L - \bar{L}| \]

Unique Amino Acid Features¶

Let \( U \) denote the number of distinct amino acid types in a peptide. The unique amino acid proportion \( U_r \) is:

\[ U_r = \frac{U}{L} \]

Shannon Entropy¶

The Shannon entropy \( H \) quantifies the diversity of amino acid usage within a peptide:

\[ H = -\sum_{b \in \mathcal{A}} p_b \log_2(p_b) \]

where \( \mathcal{A} \) is the set of unique amino acids in the peptide and \( p_b \) is the frequency of amino acid \( b \):

\[ p_b = \frac{n_b}{L} \]

with \( n_b \) being the count of amino acid \( b \). A peptide composed of a single repeated amino acid has \( H = 0 \). A peptide with all unique amino acids has maximal entropy \( H = \log_2(L) \).

Configuration¶

featureGenerator:
  - name: Basic

No parameters are required. The Basic feature uses default preprocessing settings (remove_pre_nxt_aa: true, remove_modification: true).