A Fast and Memory‐Efficient Spectral Library Search Algorithm Using Locality‐Sensitive Hashing

With the accumulation of MS/MS spectra collected in spectral libraries, the spectral library searching approach emerges as an important approach for peptide identification in proteomics, complementary to the commonly used protein database searching approach, in particular for the proteomic analyses of well‐studied model organisms, such as human. Existing spectral library searching algorithms compare a query MS/MS spectrum with each spectrum in the library with matched precursor mass and charge state, which may become computationally intensive with the rapidly growing library size. Here, the software msSLASH, which implements a fast spectral library searching algorithm based on the Locality‐Sensitive Hashing (LSH) technique, is presented. The algorithm first converts the library and query spectra into bit‐strings using LSH functions, and then computes the similarity between the spectra with highly similar bit‐string. Using the spectral library searching of large real‐world MS/MS spectra datasets, it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2–9X speedup in comparison with existing spectral library searching algorithm SpectraST. The spectral searching algorithm is implemented in C/C++, and is ready to be used in proteomic data analyses.

[1]  William Stafford Noble,et al.  Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. , 2006, Analytical chemistry.

[2]  Jian Wang,et al.  Assembling the Community-Scale Discoverable Human Proteome , 2018, Cell systems.

[3]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[4]  N. Medvidović,et al.  Proceedings of the 33rd International Conference on Software Engineering , 2011 .

[5]  H. Rodriguez,et al.  Mass spectrometry‐based targeted quantitative proteomics: Achieving sensitive and reproducible detection of proteins , 2012, Proteomics.

[6]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[7]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[8]  M. Mann,et al.  A large synthetic peptide and phosphopeptide reference library for mass spectrometry–based proteomics , 2013, Nature Biotechnology.

[9]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10]  Predrag Radivojac,et al.  On the accuracy and limits of peptide fragmentation spectrum prediction. , 2011, Analytical chemistry.

[11]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[12]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[13]  R. Bertolla,et al.  Prostate cancer proteomics: clinically useful protein biomarkers and future perspectives , 2018, Expert review of proteomics.

[14]  Alexey I Nesvizhskii,et al.  MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics , 2017, Nature Methods.

[15]  Leigh Anderson,et al.  Candidate‐based proteomics in the search for biomarkers of cardiovascular disease , 2005, The Journal of physiology.

[16]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[17]  A. H. Robinson,et al.  Results of a prototype television bandwidth compression scheme , 1967 .

[18]  Johannes Griss,et al.  Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets , 2016, Nature Methods.

[19]  Surendra Dasari,et al.  Proteomic identification of salivary biomarkers of type-2 diabetes. , 2009, Journal of proteome research.

[20]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[21]  K. K. Nambiar,et al.  Foundations of Computer Science , 2001, Lecture Notes in Computer Science.

[22]  Emma L. Schymanski,et al.  Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects , 2016 .

[23]  S. Stein,et al.  Extending a Tandem Mass Spectral Library to Include MS2 Spectra of Fragment Ions Produced In-Source and MSn Spectra , 2017, Journal of The American Society for Mass Spectrometry.

[24]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[25]  Wen-Lian Hsu,et al.  Spectrum-based method to generate good decoy libraries for spectral library searching in peptide identifications. , 2013, Journal of proteome research.

[26]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[27]  Jacob Benesty,et al.  Noise Reduction in Speech Processing , 2009 .

[28]  Ruedi Aebersold,et al.  Mass-spectrometric exploration of proteome structure and function , 2016, Nature.

[29]  C. Gatz,et al.  Cellular Signature of SIL1 Depletion: Disease Pathogenesis due to Alterations in Protein Composition Beyond the ER Machinery , 2016, Molecular Neurobiology.

[30]  Xin Zhang,et al.  Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis , 2011, Proteomics.

[31]  S. Craft,et al.  Cerebrospinal Fluid and Blood-Based Biomarkers in Alzheimer’s Disease and Type 2 Diabetes Spectrum Disorders , 2018 .

[32]  Lei Wang,et al.  msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing. , 2018, Journal of proteome research.

[33]  R. Wehrens,et al.  The WEIZMASS spectral library for high-confidence metabolite identification , 2016, Nature Communications.

[34]  Steven A Carr,et al.  Protein biomarker discovery and validation: the long and uncertain path to clinical utility , 2006, Nature Biotechnology.

[35]  M. Strachan Type 2 Diabetes and Dementia , 2011 .

[36]  R. Beavis,et al.  Using annotated peptide mass spectrum libraries for protein identification. , 2006, Journal of proteome research.

[37]  B Van Puyvelde,et al.  Removing the hidden data dependency of DIA with predicted spectral libraries , 2019, bioRxiv.

[38]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[39]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[40]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[41]  Haixu Tang,et al.  Full-Spectrum Prediction of Peptides Tandem Mass Spectra using Deep Neural Network. , 2020, Analytical chemistry.

[42]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[43]  Mathias Wilhelm,et al.  Building ProteomeTools based on a complete synthetic human proteome , 2017, Nature Methods.

[44]  S. Carr,et al.  A pipeline that integrates the discovery and verification of plasma protein biomarkers reveals candidate markers for cardiovascular disease , 2011, Nature Biotechnology.

[45]  Joshua E. Elias,et al.  Target-Decoy Search Strategy for Mass Spectrometry-Based Proteomics , 2010, Proteome Bioinformatics.

[46]  D. Fayuk,et al.  The Journal of Physiology , 1978, Medical History.

[47]  Wout Bittremieux,et al.  Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing. , 2018, Journal of proteome research.

[48]  Ting Chen,et al.  Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search , 2007, Bioinform..

[49]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[50]  Marco Y. Hein,et al.  A Human Interactome in Three Quantitative Dimensions Organized by Stoichiometries and Abundances , 2015, Cell.

[51]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[52]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[53]  Jean-Daniel Boissonnat,et al.  Proceedings of the twentieth annual symposium on Computational geometry , 2004, SoCG 2004.

[54]  Shivakumar Keerthikumar,et al.  Proteome Bioinformatics , 2017, Methods in Molecular Biology.