An efficient algorithm for clustering of large-scale mass spectrometry data

High-throughput spectrometers are capable of producing data sets containing thousands of spectra for a single biological sample. These data sets contain a substantial amount of redundancy from peptides that may get selected multiple times in a LC-MS/MS experiment. In this paper, we present an efficient algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data which increases both the sensitivity and confidence of spectral assignment. CAMS utilizes a novel metric, called F-set, that allows accurate identification of the spectra that are similar. A graph theoretic framework is defined that allows the use of F-set metric efficiently for accurate cluster identifications. The accuracy of the algorithm is tested on real HCD and CID data sets with varying amounts of peptides. Our experiments show that the proposed algorithm is able to cluster spectra with very high accuracy in a reasonable amount of time for large spectral data sets. Thus, the algorithm is able to decrease the computational time by compressing the data sets while increasing the throughput of the data by interpreting low S/N spectra.

[1]  Daniel P. Miranker,et al.  A fast coarse filtering method for peptide identification by mass spectrometry , 2006, Bioinform..

[2]  Trairak Pisitkun,et al.  PhosSA: Fast and accurate phosphorylation site assignment algorithm for mass spectrometry data , 2013, Proteome Science.

[3]  Ilan Beer,et al.  Improving large‐scale proteomics by clustering of mass spectrometry data , 2004, Proteomics.

[4]  Fahad Saeed,et al.  Dynamics of the G Protein-coupled Vasopressin V2 Receptor Signaling Network Revealed by Quantitative Phosphoproteomics* , 2011, Molecular & Cellular Proteomics.

[5]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[6]  Ingo K Mellinghoff,et al.  Tracing cancer networks with phosphoproteomics , 2010, Nature Biotechnology.

[7]  Paul Taylor,et al.  Emerging applications for phospho-proteomics in cancer molecular therapeutics. , 2006, Biochimica et biophysica acta.

[8]  Ting Chen,et al.  Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search , 2007, Bioinform..

[9]  A. Venter,et al.  Journal of The American Society for Mass Spectrometry , 2005, Journal of the American Society for Mass Spectrometry.

[10]  H. Daub,et al.  Glycoprotein Capture and Quantitative Phosphoproteomics Indicate Coordinated Regulation of Cell Migration upon Lysophosphatidic Acid Stimulation* , 2010, Molecular & Cellular Proteomics.

[11]  W. McDonald,et al.  MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra , 2005, Journal of the American Society for Mass Spectrometry.

[12]  Suresh Mathivanan,et al.  Global proteomic profiling of phosphopeptides using electron transfer dissociation tandem mass spectrometry , 2007, Proceedings of the National Academy of Sciences.

[13]  J. Yates,et al.  Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. , 2003, Analytical chemistry.