Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0

AbstractPercolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator’s processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method—grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein—in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. Graphical Abstractᅟ

[1]  Michael J MacCoss,et al.  Comparison of database search strategies for high precursor mass accuracy MS/MS data. , 2010, Journal of proteome research.

[2]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[3]  William Stafford Noble,et al.  A review of statistical methods for protein identification using tandem mass spectrometry. , 2012, Statistics and its interface.

[4]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[5]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[6]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[7]  Lennart Martens,et al.  A decoy-free approach to the identification of peptides. , 2015, Journal of proteome research.

[8]  Lukas Käll,et al.  Mass fingerprinting of complex mixtures: protein inference from high-resolution peptide masses and predicted retention times. , 2013, Journal of proteome research.

[9]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[10]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[11]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.

[12]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[13]  William Stafford Noble,et al.  Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. , 2013, Journal of proteomics.

[14]  M. MacCoss,et al.  High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. , 2007, Analytical chemistry.

[15]  Ruedi Aebersold,et al.  The Need for Guidelines in Publication of Peptide and Protein Identification Data , 2004, Molecular & Cellular Proteomics.

[16]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[17]  Yi-Kuo Yu,et al.  Mass spectrometry-based protein identification with accurate statistical significance assignment , 2014, Bioinform..

[18]  Dan Xie,et al.  Variation and Genetic Control of Protein Abundance in Humans , 2013, Nature.

[19]  John D. Venable,et al.  MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. , 2004, Rapid communications in mass spectrometry : RCM.

[20]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[21]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[22]  Matthew The,et al.  How to talk about protein‐level false discovery rates in shotgun proteomics , 2016, Proteomics.

[23]  Lukas Käll,et al.  Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. , 2015, Journal of proteome research.

[24]  Mathias Wilhelm,et al.  A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets , 2015, Molecular & Cellular Proteomics.

[25]  Shamil R. Sunyaev,et al.  Assigning spectrum-specific P-values to protein identifications by mass spectrometry , 2011, Bioinform..

[26]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[27]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[28]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.