Evaluation of the accuracy of false alarm frequency control methods for de novo spectrum
https://doi.org/10.21869/2223-1536-2025-15-3-122-141
Abstract
The purpose of the research is comparison of machine learning-based approaches (deep learning) and classical methods for mass spectrum annotation in big data conditions, as well as identification of the optimal scenario for their integration.
Methods. The study is based on the PXD004452 dataset containing 2,5 million unique peptides. An interaction scheme based on Python/TensorFlow/PyTorch has been developed, which provides parallel processing of peptide spectra on a GPU cluster. The following steps were used: filtering of the top 150 peaks by intensity; generation of theoretical B-/Y-ions, taking into account modifications; prediction of peptides (PepNet – convolutional+recurrent network; Tidesearch – index-shifting strategy). Metrics: number of matches, delta mass, Levenshtein distance, ROC curves, error distribution.
Results. PepNet requires significant computational resources, while the prediction quality is inferior to Tide-search, especially for long peptides and modifications (~average match: 4,2 pi vs 9,7; p < 0,001). However, PepNet performs better in those spectra where relevant sequences are missing in the database search, demonstrating an important ability to identify novel peptides. Levenshtein distance distribution: ~30% is a complete match (0); ~52% is a small deviation (1-5); the rest is significant discrepancies (>5).
Conclusions. The deep learning (PepNet) method shows promise, but without integration with database search, it is inferior in accuracy. A hybrid architecture is proposed: pep-tagging via PepNet, followed by refinement and verification via database search. Such a big data pipeline will combine the discovery of new peptides (de novo) and high identification reliability (database search).
About the Author
M. M. TevyashovRussian Federation
Mikhail M. Teviashov, Research Assistant
30-32 Griboedov canal Emb., St. Petersburg 191023
References
1. De novo: definition, application, meaning. (In Russ.) Available at: https://www.cd-genomics.com/ blog/de-novo-definition-applications-meaning/ (accessed 15.06.2025).
2. Acquaye F.L., Kertesz-Farkas A., Noble W.S. Efficient indexing of peptides for database search using Tide. Journal of Proteome Research. 2023;22(2):577–584. (In Russ.)
3. De novo protein sequencing: applications, problems and achievements. (In Russ.) Available at: https://www.creative-proteomics.com/resource/protein-de-novo-sequencing-applications-challenges-advances.htm (accessed 11.06.2025).
4. Ng C.C.A., Zhou Y., Yao Z.P. Algorithms for de novo sequencing of peptides by tandem mass spectrometry: A review. Analytica Chimica Acta. 2023;(1268):341330. (In Russ.)
5. De novo protein sequencing: applications, challenges, and achievements. (In Russ.) Available at: https://www.creative-proteomics.com/resource/protein-de-novo-sequencing-applications-challenges-advances.htm (accessed 13.06.2025).
6. Kaiyuan Liu, Yuzhen Ye, Sujun Li, Haixu Tang. Accurate de novo peptide se-quencing using fully convolutional neural networks. Nature Communications. 2023;(14):7974. (In Russ.)
7. Basic fragmentation terms: B-ions and Y-ions in peptide mass spectrometry. (In Russ.) Available at: https://www.mtoz-biolabs.com/how-are-the-b-ions-and-y-ions-defined-in-massspectrometry.html (accessed 15.06.2025).
8. Levenshtein distance. (In Russ.) Available at: https://en.wikipedia.org/wiki/Levenshtein_distance (accessed 15.06.2025).
9. Shan P., Tran H., et al. Integrating Database Search and de Novo Sequencing for Immunopeptidomics with DIA Approach. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6936894/ (accessed 15.06.2025).
10. Ebrahimi S., Guo X. Transformer based de novo peptide sequencing for data independent acquisition mass spectrometry (DiaTrans). In: 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE). P. 17–22. Available at: https://arxiv.org/abs/2402.11363 (accessed 22.06.2025).
11. Chen Chen, Jie Hou, John J. Tanner, Jianlin Cheng. Bioinformatics Methods for Mass Spectrometry Based Proteomics Data Analysis. Int. J. Mol. Sci. 2020;(21):2873. https://doi.org/10.3390/ijms21082873
12. Ge C., et al. DePS: An improved deep learning model for de novo peptide sequencing. Available at: https://arxiv.org/abs/2203.08820 (accessed 22.06.2025).
13. Petrovskiy D. V., et al. PowerNovo: de novo peptide sequencing via tandem mass spectrometry using an ensemble of transformer and BERT models. Sci. Rep. 2024;(14):15000.
14. Du Y., et al. Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing (LIPNovo). Available at: https://arxiv.org/html/2505.17524v1 (accessed 22.06.2025).
15. Zhang X., et al. π PrimeNovo: an accurate and efficient non autoregressive deep learning model for de novo peptide sequencing. Nat. Commun. 2025;(16):267.
16. Schoenholz S.S., Hackett S., Deming L., et al. Peptide-Spectra Matching from Weak Supervision. Available at: https://arxiv.org/abs/1808.06576. (accessed 12.06.2025).
17. Cheng J., et al. Complementary methods for de novo monoclonal antibody sequencing to achieve complete sequence coverage. J. Proteome. Res. 2020;19(7):2700–2707.
18. Tran N.H., et al. De novo peptide sequencing by deep learning. Proceedings of the National Academy of Sciences (PNAS). 2017;114(31):8247–8252. https://doi.org/10.1073/pnas.1705691114
19. Yang Y., et al. DPST: De Novo Peptide Sequencing with Amino Acid Aware Transformers. Available at: https://arxiv.org/abs/2203.13132 (accessed 22.06.2025).
Review
For citations:
Tevyashov M.M. Evaluation of the accuracy of false alarm frequency control methods for de novo spectrum. Proceedings of the Southwest State University. Series: IT Management, Computer Science, Computer Engineering. Medical Equipment Engineering. 2025;15(3):122-141. (In Russ.) https://doi.org/10.21869/2223-1536-2025-15-3-122-141


