When less can yield more – Computational preprocessing of MS/MS spectra for peptide identification

The effectiveness of database search algorithms, such as Mascot, Sequest and ProteinPilot is limited by the quality of the input spectra: spurious peaks in MS/MS spectra can jeopardize the correct identification of peptides or reduce their score significantly. Consequently, an efficient preprocessing of MS/MS spectra can increase the sensitivity of peptide identification at reduced file sizes and run time without compromising its specificity. We investigate the performance of 25 MS/MS preprocessing methods on various data sets and make software for improved preprocessing of mgf/dta‐files freely available from http://hci.iwr.uni‐heidelberg.de/mip/proteomics or http://www.childrenshospital.org/research/steenlab.

[1]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[2]  Leo C. McHugh,et al.  Computational Methods for Protein Identification from Mass Spectrometry Data , 2008, PLoS Comput. Biol..

[3]  B. Balgley,et al.  Comparative Evaluation of Tandem MS Search Algorithms Using a Target-Decoy Search Strategy*S , 2007, Molecular & Cellular Proteomics.

[4]  Karl Mechtler,et al.  Cleaning of raw peptide MS/MS spectra: Improved protein identification following deconvolution of multiply charged peaks, isotope clusters, and removal of background noise , 2006, Proteomics.

[5]  Eunok Paek,et al.  Quality assessment of tandem mass spectra based on cumulative intensity normalization. , 2006, Journal of proteome research.

[6]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[7]  Hanno Steen,et al.  Robust prediction of the MASCOT score for an improved quality assessment in mass spectrometric proteomics. , 2008, Journal of proteome research.

[8]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[9]  Tero Aittokallio,et al.  Filtering strategies for improving protein identification in high‐throughput MS/MS studies , 2009, Proteomics.

[10]  Robert J Chalkley,et al.  Mass Spectrometric Analysis of Protein Mixtures at Low Levels Using Cleavable 13C-Isotope-coded Affinity Tag and Multidimensional Chromatography* , 2003, Molecular & Cellular Proteomics.

[11]  T. Hubbard,et al.  Comparison of Mascot and X!Tandem Performance for Low and High Accuracy Mass Spectrometry and the Development of an Adjusted Mascot Threshold*S , 2008, Molecular & Cellular Proteomics.

[12]  Bernhard Y. Renard,et al.  NITPICK: peak identification for mass spectrometry data , 2008, BMC Bioinformatics.

[13]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.

[14]  Albert Sickmann,et al.  Extractor for ESI quadrupole TOF tandem MS data enabled for high throughput batch processing , 2004, BMC Bioinformatics.

[15]  Navdeep Jaitly,et al.  DeconMSn: a software tool for accurate parent ion monoisotopic mass determination for tandem mass spectra , 2008, Bioinform..

[16]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[17]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[18]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[19]  Waltraud X. Schulze,et al.  A Novel Proteomic Screen for Peptide-Protein Interactions* , 2004, Journal of Biological Chemistry.

[20]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[21]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[22]  W. McDonald,et al.  MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra , 2005, Journal of the American Society for Mass Spectrometry.

[23]  T. Köcher,et al.  Preprocessing of tandem mass spectrometric data to support automatic protein identification , 2003, Proteomics.

[24]  Sean L Seymour,et al.  The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra*S , 2007, Molecular & Cellular Proteomics.

[25]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[26]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[27]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[28]  Sean L Seymour,et al.  Nonlinear fitting method for determining local false discovery rates from decoy database searches. , 2008, Journal of proteome research.

[29]  Rovshan G Sadygov,et al.  Charger: combination of signal processing and statistical learning algorithms for precursor charge-state determination from electron-transfer dissociation spectra. , 2008, Analytical chemistry.

[30]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[31]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.