SWPepNovo: An Efficient De Novo Peptide Sequencing Tool for Large-scale MS/MS Spectra Analysis

Tandem mass spectrometry (MS/MS)-based de novo peptide sequencing is a powerful method for high-throughput protein analysis. However, the explosively increasing size of MS/MS spectra dataset inevitably and exponentially raises the computational demand of existing de novo peptide sequencing methods, which is an issue urgently to be solved in computational biology. This paper introduces an efficient tool based on SW26010 many-core processor, namely SWPepNovo, to process the large-scale peptide MS/MS spectra using a parallel peptide spectrum matches (PSMs) algorithm. Our design employs a two-level parallelization mechanism: (1) the task-level parallelism between MPEs using MPI based on a data transformation method and a dynamic feedback task scheduling algorithm, (2) the thread-level parallelism across CPEs using asynchronous task transfer and multithreading. Moreover, three optimization strategies, including vectorization, double buffering and memory access optimizations, have been employed to overcome both the compute-bound and the memory-bound bottlenecks in the parallel PSMs algorithm. The results of experiments conducted on multiple spectra datasets demonstrate the performance of SWPepNovo against three state-of-the-art tools for peptide sequencing, including PepNovo+, PEAKS and DeepNovo-DIA. The SWPepNovo also shows high scalability in experiments on extremely large datasets sized up to 11.22 GB. The software and the parameter settings are available at https://github.com/ChuangLi99/SWPepNovo.

[1]  K. Standing Peptide and protein de novo sequencing by mass spectrometry. , 2003, Current opinion in structural biology.

[2]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[3]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[4]  Sean L Seymour,et al.  Discovering known and unanticipated protein modifications using MS/MS database searching. , 2005, Analytical chemistry.

[5]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[6]  Xu Zhou,et al.  Efficient top-(k,l) range query processing for uncertain data based on multicore architectures , 2015, Distributed and Parallel Databases.

[7]  Kenli Li,et al.  Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Jens H. Krüger,et al.  GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[9]  B. Ma Novor: Real-Time Peptide de Novo Sequencing Software , 2015, Journal of The American Society for Mass Spectrometry.

[10]  Masato Ishikawa,et al.  MASCOT: multiple alignment system for protein sequences based on three- way dynamic programming , 1993, Comput. Appl. Biosci..

[11]  Hsueh-Ming Hang,et al.  H.264/AVC motion estimation implmentation on Compute Unified Device Architecture (CUDA) , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[12]  Bo Li,et al.  PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[13]  Kenli Li,et al.  GPU implementation of a parallel two‐list algorithm for the subset‐sum problem , 2015, Concurr. Comput. Pract. Exp..

[14]  Jens Allmer,et al.  Algorithms for the de novo sequencing of peptides from tandem mass spectra , 2011, Expert review of proteomics.

[15]  Kenli Li,et al.  iProX: an integrated proteome resource , 2018, Nucleic Acids Res..

[16]  Yan Fu,et al.  Speeding up tandem mass spectrometry based database searching by peptide and spectrum indexing. , 2010, Rapid communications in mass spectrometry : RCM.

[17]  Tao Zhang,et al.  CRISPRMatch: An Automatic Calculation and Visualization Tool for High-throughput CRISPR Genome-editing Data Analysis , 2018, International journal of biological sciences.

[18]  Keqin Li,et al.  Implementation and optimization of a data protecting model on the Sunway TaihuLight supercomputer with heterogeneous many‐core processors , 2019, Concurr. Comput. Pract. Exp..

[19]  Wen Gao,et al.  pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry , 2005, Bioinform..

[20]  Alejandro Duran,et al.  The Intel® Many Integrated Core Architecture , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[21]  Barbara Horner-Miller,et al.  Proceedings of the 2006 ACM/IEEE conference on Supercomputing , 2006 .

[22]  S. A. McLuckey,et al.  Collision-induced dissociation (CID) of peptides and proteins. , 2005, Methods in enzymology.

[23]  Andreas Hildebrandt,et al.  Highly accelerated feature detection in proteomics data sets using modern graphics processing units , 2009, Bioinform..

[24]  Nan Yang,et al.  A disease diagnosis and treatment recommendation system based on big data mining and cloud computing , 2018, Inf. Sci..

[25]  Junsu Lee,et al.  BulkAligner: A novel sequence alignment algorithm based on graph theory and Trinity , 2015, Inf. Sci..

[26]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[27]  M. Mann,et al.  Higher-energy C-trap dissociation for peptide modification analysis , 2007, Nature Methods.

[28]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[29]  Tao Li,et al.  CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight , 2021, IEEE Transactions on Parallel and Distributed Systems.

[30]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[31]  Kenli Li,et al.  MRUniNovo: an efficient tool for de novo peptide sequencing utilizing the hadoop distributed computing framework , 2016, Bioinform..

[32]  Pavel A. Pevzner,et al.  UniNovo: a universal tool for de novo peptide sequencing , 2013, RECOMB.

[33]  Kenli Li,et al.  Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer , 2019, IEEE Transactions on Parallel and Distributed Systems.

[34]  Kenli Li,et al.  Implementing molecular dynamics simulation on the Sunway TaihuLight system with heterogeneous many‐core processors , 2018, Concurr. Comput. Pract. Exp..

[35]  Yan Fu,et al.  pNovo: de novo peptide sequencing and identification using HCD spectra. , 2010, Journal of proteome research.

[36]  Ari M Frank,et al.  A ranking-based scoring function for peptide-spectrum matches. , 2009, Journal of proteome research.

[37]  Shaoliang Peng,et al.  Special issue on Computational Resources and Methods in Biological Sciences , 2018, International journal of biological sciences.

[38]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[39]  Xiaolong Wu,et al.  Graphics Processing Units and Open Computing Language for parallel computing , 2014, Comput. Electr. Eng..

[40]  C. Costello,et al.  Tandem mass spectrometry. , 1993, Methods in molecular biology.