A protein standard that emulates homology for the characterization of protein inference algorithms

A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.

[1]  Martin Eisenacher,et al.  In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. , 2017, Journal of proteomics.

[2]  T. Veenstra,et al.  What to do with “one‐hit wonders”? , 2004, Electrophoresis.

[3]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.

[4]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[5]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[6]  Matthew The,et al.  How to talk about protein‐level false discovery rates in shotgun proteomics , 2016, Proteomics.

[7]  Eugene Kolker,et al.  Estimating false discovery rates for peptide and protein identification using randomized databases , 2010, Proteomics.

[8]  Knut Reinert,et al.  MSSimulator: Simulation of mass spectrometry data. , 2011, Journal of proteome research.

[9]  Shamil R. Sunyaev,et al.  Assigning spectrum-specific P-values to protein identifications by mass spectrometry , 2011, Bioinform..

[10]  Michael D. Litton,et al.  IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. , 2009, Journal of proteome research.

[11]  Lukas Käll,et al.  Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. , 2015, Journal of proteome research.

[12]  William Stafford Noble,et al.  Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0 , 2016, Journal of The American Society for Mass Spectrometry.

[13]  Yi-Kuo Yu,et al.  Mass spectrometry-based protein identification with accurate statistical significance assignment , 2014, Bioinform..

[14]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[15]  M. Mann,et al.  Universal sample preparation method for proteome analysis , 2009, Nature Methods.

[16]  Lydie Lane,et al.  Progress on the HUPO Draft Human Proteome: 2017 Metrics of the Human Proteome Project. , 2017, Journal of proteome research.

[17]  William Stafford Noble,et al.  On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. , 2011, Journal of proteome research.

[18]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[19]  F. Pontén,et al.  Towards a human proteome atlas: High‐throughput generation of mono‐specific antibodies for tissue profiling , 2005, Proteomics.

[20]  William Stafford Noble,et al.  Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. , 2013, Journal of proteomics.

[21]  Martin Eisenacher,et al.  PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface. , 2015, Journal of proteome research.

[22]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[23]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[24]  M. MacCoss,et al.  High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. , 2007, Analytical chemistry.

[25]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[26]  Lukas Käll,et al.  Recognizing uncertainty increases robustness and reproducibility of mass spectrometry-based protein inferences. , 2012, Journal of proteome research.

[27]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[28]  E. Lundberg,et al.  Towards a knowledge-based Human Protein Atlas , 2010, Nature Biotechnology.

[29]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[30]  Michael J MacCoss,et al.  Comparison of database search strategies for high precursor mass accuracy MS/MS data. , 2010, Journal of proteome research.

[31]  Mathias Wilhelm,et al.  A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets , 2015, Molecular & Cellular Proteomics.