CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling

High-throughput mass spectrometers can produce massive amounts of redundant data at an astonishing rate with many of them having poor signal-to-noise (S/N) ratio. These low S/N ratio spectra may not get interpreted using conventional spectra-to-database matching techniques. In this paper, we present an efficient algorithm, CAMS-RS (Clustering Algorithm for Mass Spectra using Restricted Space and Sampling) for clustering of raw mass spectrometry data. CAMS-RS utilizes a novel metric (called F-set) that exploits the temporal and spatial patterns to accurately assess similarity between two given spectra. The F-set similarity metric is independent of the retention time and allows clustering of mass spectrometry data from independent LC-MS/MS runs. A novel restricted search space strategy is devised to limit the comparisons of the number of spectra. An intelligent sampling method is executed on individual bins that allow merging of the results to make the final clusters. Our experiments, using experimentally generated data sets, show that the proposed algorithm is able to cluster spectra with high accuracy and is helpful in interpreting low S/N ratio spectra. The CAMS-RS algorithm is highly scalable with increasing number of spectra and our implementation allows clustering of up to a million spectra within minutes.

[1]  J. Yates,et al.  Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. , 2003, Analytical chemistry.

[2]  H. Daub,et al.  Glycoprotein Capture and Quantitative Phosphoproteomics Indicate Coordinated Regulation of Cell Migration upon Lysophosphatidic Acid Stimulation* , 2010, Molecular & Cellular Proteomics.

[3]  W. McDonald,et al.  MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra , 2005, Journal of the American Society for Mass Spectrometry.

[4]  B. Kuster,et al.  Confident Phosphorylation Site Localization Using the Mascot Delta Score , 2010, Molecular & Cellular Proteomics.

[5]  Ying Xu,et al.  The Probability Distribution for a Random Match between an Experimental-theoretical Spectral Pair in Tandem Mass Spectrometry , 2005, J. Bioinform. Comput. Biol..

[6]  Fahad Saeed,et al.  A high performance algorithm for clustering of large-scale protein mass spectrometry data using multi-core architectures , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[7]  Ting Chen,et al.  Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search , 2007, Bioinform..

[8]  Daniel P. Miranker,et al.  A fast coarse filtering method for peptide identification by mass spectrometry , 2006, Bioinform..

[9]  Naren Ramakrishnan,et al.  Clustering mass spectrometry data using order statistics , 2003, Proteomics.

[10]  William Stafford Noble,et al.  Peptide Retention Time Prediction Yields Improved Tandem Mass Spectrum Identification for Diverse Chromatography Conditions , 2007, RECOMB.

[11]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[12]  Alexey I Nesvizhskii,et al.  Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments , 2012, Proteomics.

[13]  Fahad Saeed,et al.  High performance phosphorylation site assignment algorithm for mass spectrometry data using multicore systems , 2012, BCB '12.

[14]  Malcolm J. McConville,et al.  Progressive peak clustering in GC-MS Metabolomic experiments applied to Leishmania parasites , 2006, Bioinform..

[15]  M. MacCoss,et al.  A fast SEQUEST cross correlation algorithm. , 2008, Journal of proteome research.

[16]  Paul Taylor,et al.  Emerging applications for phospho-proteomics in cancer molecular therapeutics. , 2006, Biochimica et biophysica acta.

[17]  Guanghui Wang,et al.  An efficient dynamic programming algorithm for phosphorylation site assignment of large-scale mass spectrometry data , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops.

[18]  Fahad Saeed,et al.  An efficient algorithm for clustering of large-scale mass spectrometry data , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[19]  Xinning Jiang,et al.  Classification filtering strategy to improve the coverage and sensitivity of phosphoproteome analysis. , 2010, Analytical chemistry.

[20]  Halima Bensmail,et al.  A novel approach for clustering proteomics data using Bayesian fast Fourier transform , 2005, Bioinform..

[21]  Huiru Zheng,et al.  Method for clustering mass spectrometry data in drug development , 2000 .

[22]  J. L. Jennings,et al.  Cluster Analysis of Mass Spectrometry Data Reveals a Novel Component of SAGA , 2004, Molecular and Cellular Biology.

[23]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[24]  Trairak Pisitkun,et al.  Quantitative phosphoproteomics of vasopressin-sensitive renal cells: regulation of aquaporin-2 phosphorylation at two sites. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Kai Stühler,et al.  Retention time alignment algorithms for LC/MS data must consider non-linear shifts , 2009, Bioinform..

[26]  Lennart Martens,et al.  Implementation and application of a versatile clustering tool for tandem mass spectrometry data , 2007, Proteomics.

[27]  Suresh Mathivanan,et al.  Global proteomic profiling of phosphopeptides using electron transfer dissociation tandem mass spectrometry , 2007, Proceedings of the National Academy of Sciences.

[28]  Julian P Whitelegge,et al.  HPLC and mass spectrometry of intrinsic membrane proteins. , 2004, Methods in molecular biology.

[29]  Matthew E Monroe,et al.  Linear discriminant analysis-based estimation of the false discovery rate for phosphopeptide identifications. , 2008, Journal of proteome research.

[30]  Fahad Saeed,et al.  Dynamics of the G Protein-coupled Vasopressin V2 Receptor Signaling Network Revealed by Quantitative Phosphoproteomics* , 2011, Molecular & Cellular Proteomics.

[31]  Bo Yan,et al.  A graph-theoretic approach for the separation of b and y ions in tandem mass spectra , 2005, Bioinform..

[32]  Ilan Beer,et al.  Improving large‐scale proteomics by clustering of mass spectrometry data , 2004, Proteomics.

[33]  Éva Tardos,et al.  Algorithm design , 2005 .