msCRUSH: fast tandem mass spectra clustering using locality sensitive hashing

Large-scale proteomics projects often generate massive and highly redundant tandem mass (MS/MS) spectra. Spectra clustering algorithms can reduce the redundancy in these datasets, and thus speed up the database searching for peptide identification, a major bottleneck for proteomic data analysis. Furthermore, the consensus spectra derived from highly similar MS/MS spectra in the same cluster may enhance the signal peaks while reduce the noise peaks, and thus will improve the sensitivity of peptide identification. In this paper, we present the software msCRUSH, which implemented a novel spectra clustering algorithm based on the locality sensitive hashing (LSH) technique. When tested on a large-scale proteomic dataset consisting of 18.4 million spectra (including 11.5 million spectra of charge 2+), msCRUSH runs 7.6-12.1x faster than the state-of-the-art spectra clustering software, PRIDE Cluster, while achieves higher clustering sensitivity and comparable accuracy. Using the consensus spectra reported by msCRUSH, commonly used spectra search engines MSGF+ and Mascot can identify 5% and 4% more unique peptides, respectively, comparing to the identification results from the raw MS/MS spectra at the same false discovery rate (1% FDR) of peptides. msCRUSH is implemented in C++, and is released as open source software.

[1]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[2]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[5]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[6]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[7]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[8]  Karina D. Sørensen,et al.  An Optimized Shotgun Strategy for the Rapid Generation of Comprehensive Human Proteomes , 2017, Cell systems.

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Jesper V Olsen,et al.  Recent findings and technological advances in phosphoproteomics for cells and tissues , 2015, Expert review of proteomics.

[11]  Jacob Benesty,et al.  Noise Reduction in Speech Processing , 2009 .

[12]  Ting Chen,et al.  Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search , 2007, Bioinform..

[13]  James P. Reilly,et al.  Advancement in Protein Inference from Shotgun Proteomics Using Peptide Detectability , 2006, Pacific Symposium on Biocomputing.

[14]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[15]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[16]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[17]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[18]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[19]  Fahad Saeed,et al.  CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[21]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Ilan Beer,et al.  Improving large‐scale proteomics by clustering of mass spectrometry data , 2004, Proteomics.

[23]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[24]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[25]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[26]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[27]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[28]  Matthew The,et al.  MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. , 2016, Journal of proteome research.

[29]  A. Emili,et al.  Tissue subcellular fractionation and protein extraction for use in mass-spectrometry-based proteomics , 2006, Nature Protocols.

[30]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[31]  S. Eckhardt,et al.  Clinical Applications of Metabolomics in Oncology: A Review , 2009, Clinical Cancer Research.

[32]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[33]  C. Gatz,et al.  Cellular Signature of SIL1 Depletion: Disease Pathogenesis due to Alterations in Protein Composition Beyond the ER Machinery , 2016, Molecular Neurobiology.

[34]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[35]  Johannes Griss,et al.  Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets , 2016, Nature Methods.

[36]  Subha Madhavan,et al.  The CPTAC Data Portal: A Resource for Cancer Proteomics Research. , 2015, Journal of proteome research.

[37]  Michael L. Gatza,et al.  Proteogenomics connects somatic mutations to signaling in breast cancer , 2016, Nature.

[38]  J. Yates,et al.  Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. , 2003, Analytical chemistry.