Skip to content

Spectral Similarity

The SpectralSimilarity feature (name: SpectralSimilarity) compares experimentally observed MS2 spectra against theoretical spectra predicted by a deep learning model served via Koina. Koina is an open-source, web-accessible model repository that democratizes access to machine learning models for proteomics research, enabling remote execution of prediction models via standard HTTP/S requests without requiring specialized hardware (Lautenbacher et al., Nature Communications, 2025). The similarity between observed and predicted fragmentation patterns is a powerful indicator of correct peptide-spectrum matches.

Source name: SpectralSimilarity

Output Columns

Column Description
spectral_angle_similarity Normalized spectral angle between experimental and predicted spectra
cosine_similarity Cosine similarity between intensity vectors
pearson_correlation Pearson correlation coefficient
spearman_correlation Spearman rank correlation coefficient
mean_squared_error Mean squared error on L2-normalized intensity vectors
unweighted_entropy_similarity Entropy-based similarity measure
predicted_seen_nonzero Count of predicted peaks that were observed in the experimental spectrum
predicted_not_seen Count of predicted peaks that were not observed

Computation

Step 1: Spectrum Extraction

Experimental MS2 spectra are extracted from mzML files. Each spectrum is associated with a PSM through the spectrum ID pattern. The m/z and intensity arrays are sorted by m/z.

Step 2: Theoretical Spectrum Prediction

Peptide sequences (with modifications mapped to UNIMOD notation via modificationMap) and charge states are sent to a Koina server, which returns predicted fragment ion m/z values and intensities using the specified deep learning model (e.g., AlphaPeptDeep_ms2_generic).

Step 3: Peak Alignment

For each predicted peak at m/z value \( m \), a tolerance window is computed in parts-per-million (ppm):

\[ m_\mathrm{low} = m \cdot \left(1 - \frac{\tau}{10^6}\right), \quad m_\mathrm{high} = m \cdot \left(1 + \frac{\tau}{10^6}\right) \]

where \( \tau \) is the ppm tolerance (default 20). Within this window, the experimental peak with the highest intensity is selected as the match. If no experimental peak falls within the window, the aligned experimental intensity is set to 0.

This produces an aligned experimental intensity vector \( \mathbf{e} \) of the same length as the predicted intensity vector \( \mathbf{p} \).

Step 4: Top-N Peak Selection

Only the top \( N \) peaks by predicted intensity are retained for similarity computation (default \( N = 36 \)). This focuses the comparison on the most informative fragment ions.

Step 5: Similarity Metrics

All metrics are computed on the aligned, top-N intensity vectors \( \mathbf{e} \) (experimental) and \( \mathbf{p} \) (predicted).

Spectral Angle Similarity

The spectral angle similarity \( S_\mathrm{SA} \) is derived from the cosine of the angle \( \theta \) between the two vectors:

\[ \cos\theta = \frac{\mathbf{e} \cdot \mathbf{p}}{\|\mathbf{e}\| \, \|\mathbf{p}\|} \]
\[ S_\mathrm{SA} = 1 - \frac{2 \arccos(\cos\theta)}{\pi} \]

\( S_\mathrm{SA} \) ranges from 0 (orthogonal spectra) to 1 (identical spectra). The \( 2/\pi \) normalization maps the angle from \([0, \pi/2]\) to \([0, 1]\).

Cosine Similarity

The cosine similarity \( S_{\mathrm{cos}} \) is:

\[ S_{\mathrm{cos}} = \frac{\mathbf{e} \cdot \mathbf{p}}{\|\mathbf{e}\| \, \|\mathbf{p}\|} \]

Pearson Correlation

The Pearson correlation coefficient \( r \) is:

\[ r = \frac{\sum_{i} (e_i - \bar{e})(p_i - \bar{p})}{\sqrt{\sum_{i} (e_i - \bar{e})^2} \sqrt{\sum_{i} (p_i - \bar{p})^2}} \]

where \( \bar{e} \) and \( \bar{p} \) are the means of the experimental and predicted vectors.

Spearman Correlation

The Spearman correlation \( \rho \) is the Pearson correlation computed on the ranks of the values (using average ranking for ties) rather than the values themselves.

Mean Squared Error

The MSE is computed on L2-normalized vectors. Let \( \hat{e}_i = e_i / \|\mathbf{e}\| \) and \( \hat{p}_i = p_i / \|\mathbf{p}\| \). Then:

\[ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{e}_i - \hat{p}_i)^2 \]

Unweighted Entropy Similarity

The unweighted entropy similarity is based on the concept of spectral entropy introduced by Li et al. (Nature Methods, 2021).

First, normalize intensities to probability distributions \( \mathbf{q}^{(e)} \) and \( \mathbf{q}^{(p)} \):

\[ q_i^{(e)} = \frac{e_i}{\sum_j e_j}, \quad q_i^{(p)} = \frac{p_i}{\sum_j p_j} \]

Compute the Shannon entropy of each distribution and their mixture \( \mathbf{m} \):

\[ S_e = -\sum_i q_i^{(e)} \ln q_i^{(e)}, \quad S_p = -\sum_i q_i^{(p)} \ln q_i^{(p)} \]
\[ m_i = \frac{1}{2}\left(q_i^{(e)} + q_i^{(p)}\right), \quad S_m = -\sum_i m_i \ln m_i \]

The entropy similarity \( S_\mathrm{ent} \) is:

\[ S_\mathrm{ent} = 1 - \frac{2 S_m - S_e - S_p}{\ln 4} \]

The numerator is related to the Jensen-Shannon divergence. Dividing by \( \ln 4 \) normalizes the result to \([0, 1]\).

Peak Matching Counts

  • predicted_seen_nonzero — the number of predicted peaks (with intensity > 0) that have a matched experimental peak with intensity > 0.
  • predicted_not_seen — the number of predicted peaks (with intensity > 0) that have no matching experimental peak (aligned intensity = 0).

Configuration

featureGenerator:
  - name: SpectralSimilarity
    params:
      mzmlDir: ../data               # Directory containing mzML files
      spectrumIdPattern: (.+?)\.\d+\.\d+\.\d+  # Regex to link PSMs to mzML files
      model: AlphaPeptDeep_ms2_generic          # Koina prediction model
      collisionEnergy: 28            # Collision energy for prediction
      instrument: LUMOS              # Instrument type for prediction
      tolerance: 20                  # Peak matching tolerance in ppm
      numTopPeaks: 36                # Number of top peaks to compare
      url: koina.wilhelmlab.org:443  # Koina server gRPC endpoint
      ssl: true                      # Use SSL for gRPC connection

Note

For a local Koina/Triton server, set url: 127.0.0.1:8500 and ssl: false. The default public endpoint is koina.wilhelmlab.org:443 with SSL enabled.

Requirements

This feature requires access to a Koina server — either the public endpoint or a self-hosted Triton Inference Server.