Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average, 75% of spectra analyzed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large scale to shed light on these unidentified spectra. The Proteomics Identifications (PRIDE) Database Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in the PRIDE Archive, coming from hundreds of data sets, we were able to consistently characterize spectra into three distinct groups: (1) incorrectly identified, (2) correctly identified but below the set scoring threshold, and (3) truly unidentified. Using multiple complementary analysis approaches, we were able to identify ∼20% of the consistently unidentified spectra. The complete spectrum-clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.

[1]  Quanhui Wang,et al.  Chromosome-8-coded proteome of Chinese Chromosome Proteome Data set (CCPD) 2.0 with partial immunohistochemical verifications. , 2014, Journal of proteome research.

[2]  Nichole L. King,et al.  The PeptideAtlas Project , 2010, Proteome Bioinformatics.

[3]  Robertson Craig,et al.  Open source system for analyzing, validating, and storing protein identification data. , 2004, Journal of proteome research.

[4]  Natalie I. Tasman,et al.  A Cross-platform Toolkit for Mass Spectrometry and Proteomics , 2012, Nature Biotechnology.

[5]  S. Joel,et al.  Kinase-Substrate Enrichment Analysis Provides Insights into the Heterogeneity of Signaling Pathway Activation in Leukemia Cells , 2013, Science Signaling.

[6]  P. Mallick,et al.  Peptide Identification from Mixture Tandem Mass Spectra* , 2010, Molecular & Cellular Proteomics.

[7]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[8]  Andrew Clark,et al.  cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud , 2015, Bioinform..

[9]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[10]  A. Nesvizhskii,et al.  Metrics for the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification. , 2015, Journal of proteome research.

[11]  Yu Tian,et al.  Design and Development of a Medical Big Data Processing System Based on Hadoop , 2015, Journal of Medical Systems.

[12]  Matthew The,et al.  MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. , 2016, Journal of proteome research.

[13]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[14]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[15]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[16]  Johannes Griss,et al.  PRIDE Cluster: building a consensus of proteomics data , 2013, Nature Methods.

[17]  Albert J R Heck,et al.  Quantitative erythrocyte membrane proteome analysis with Blue-native/SDS PAGE. , 2010, Journal of proteomics.

[18]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[19]  Amos Bairoch,et al.  Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. , 2014, Journal of proteome research.

[20]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[21]  Henry Lam,et al.  Expanding tandem mass spectral libraries of phosphorylated peptides: advances and applications. , 2013, Journal of proteome research.

[22]  Karl Mechtler,et al.  Development and performance evaluation of an ultralow flow nanoliquid chromatography‐tandem mass spectrometry set‐up , 2014, Proteomics.

[23]  Henry Lam,et al.  Hunting for unexpected post-translational modifications by spectral library searching with tier-wise scoring. , 2014, Journal of proteome research.

[24]  J. Bunkenborg,et al.  Data extraction from proteomics raw data: an evaluation of nine tandem MS tools using a large Orbitrap data set. , 2012, Journal of proteomics.

[25]  Pedro R. Cutillas,et al.  Environmental Stress Affects the Activity of Metabolic and Growth Factor Signaling Networks and Induces Autophagy Markers in MCF7 Breast Cancer Cells* , 2014, Molecular & Cellular Proteomics.

[26]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[27]  Henry H. N. Lam Spectral archives: a vision for future proteomics data repositories , 2011, Nature Methods.

[28]  David L Tabb,et al.  DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. , 2008, Journal of proteome research.

[29]  Pavel A. Pevzner,et al.  Spectral Archives: Extending Spectral Libraries to Analyze both Identified and Unidentified Spectra , 2011, Nature Methods.

[30]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[31]  Juan Antonio Vizcaíno,et al.  How to submit MS proteomics data to ProteomeXchange via the PRIDE database , 2014, Proteomics.

[32]  David L Tabb,et al.  Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment. , 2012, Journal of proteome research.

[33]  L. Sleno,et al.  The use of mass defect in modern mass spectrometry. , 2012, Journal of mass spectrometry : JMS.

[34]  S. Mohammed,et al.  Exploring the human leukocyte phosphoproteome using a microfluidic reversed-phase-TiO2-reversed-phase high-performance liquid chromatography phosphochip coupled to a quadrupole time-of-flight mass spectrometer. , 2010, Analytical chemistry.

[35]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[36]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[37]  Robin Kirschbaum,et al.  Questions and answers , 2009, Diabetes, obesity & metabolism.

[38]  Liang Li,et al.  Macroporous reversed‐phase separation of proteins combined with reversed‐phase separation of phosphopeptides and tandem mass spectrometry for profiling the phosphoproteome of MDA‐MB‐231 cells , 2014, Electrophoresis.

[39]  Ting-Yi Sung,et al.  Sequential phosphoproteomic enrichment through complementary metal-directed immobilized metal ion affinity chromatography. , 2014, Analytical chemistry.

[40]  James C. Wright,et al.  Confident and sensitive phosphoproteomics using combinations of collision induced dissociation and electron transfer dissociation☆ , 2014, Journal of proteomics.

[41]  M. Schittmayer,et al.  Cleaning out the Litterbox of Proteomic Scientists’ Favorite Pet: Optimized Data Analysis Avoiding Trypsin Artifacts , 2016, Journal of proteome research.

[42]  K. Gevaert,et al.  Deep Proteome Coverage Based on Ribosome Profiling Aids Mass Spectrometry-based Protein and Peptide Discovery and Provides Evidence of Alternative Translation Products and Near-cognate Translation Initiation Events* , 2013, Molecular & Cellular Proteomics.

[43]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.