Integrated identification and quantification error probabilities for shotgun proteomics

Protein quantification by label-free shotgun proteomics experiments is plagued by a multitude of error sources. Typical pipelines for identifying differentially expressed proteins use intermediate filters in an attempt to control the error rate. However, they often ignore certain error sources and, moreover, regard filtered lists as completely correct in subsequent steps. These two indiscretions can easily lead to a loss of control of the false discovery rate (FDR). We propose a probabilistic graphical model, Triqler, that propagates error information through all steps, employing distributions in favor of point estimates, most notably for missing value imputation. The model outputs posterior probabilities for fold changes between treatment groups, highlighting uncertainty rather than hiding it. We analyzed 3 engineered datasets and achieved FDR control and high sensitivity, even for truly absent proteins. In a bladder cancer clinical dataset we discovered 35 proteins at 5% FDR, whereas the original study discovered 1 and MaxQuant/Perseus 4 proteins at this threshold. Compellingly, these 35 proteins showed enrichment for functional annotation terms, whereas the top ranked proteins reported by MaxQuant/Perseus showed no enrichment. The model executes in minutes and is freely available at https://pypi.org/project/triqler/.

[1]  William Stafford Noble,et al.  qvality: non-parametric estimation of q-values and posterior error probabilities , 2009, Bioinform..

[2]  Quentin Giai Gianetto,et al.  Calibration plot for proteomics: A graphical tool to visually check the assumptions underlying FDR control in quantitative experiments , 2015, Proteomics.

[3]  Joel G. Pounds,et al.  Combined Statistical Analyses of Peptide Intensities and Peptide Occurrences Improves Identification of Significant Peptides from MS-Based Proteomics Data , 2010, Journal of proteome research.

[4]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[5]  Laurent Gatto,et al.  Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. , 2016, Journal of proteome research.

[6]  J. Kruschke Bayesian estimation supersedes the t test. , 2013, Journal of experimental psychology. General.

[7]  Mathias Wilhelm,et al.  A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets , 2015, Molecular & Cellular Proteomics.

[8]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[9]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[10]  Lukas Käll,et al.  Recognizing uncertainty increases robustness and reproducibility of mass spectrometry-based protein inferences. , 2012, Journal of proteome research.

[11]  Fredrik Levander,et al.  Dinosaur: A Refined Open-Source Peptide MS Feature Detector , 2016, Journal of proteome research.

[12]  R. Branca,et al.  SpliceVista, a Tool for Splice Variant Identification and Visualization in Shotgun Proteomics Data* , 2014, Molecular & Cellular Proteomics.

[13]  Michael P Weekes,et al.  Compositional Proteomics: Effects of Spatial Constraints on Protein Quantification Utilizing Isobaric Tags , 2017, Journal of proteome research.

[14]  Jie W Weiss,et al.  Bayesian Statistical Inference for Psychological Research , 2008 .

[15]  Tom Heskes,et al.  Empirical Bayesian random censoring threshold model improves detection of differentially abundant proteins. , 2014, Journal of proteome research.

[16]  Konstantinos Vougas,et al.  Comparative Analysis of Label-Free and 8-Plex iTRAQ Approach for Quantitative Tissue Proteomic Analysis , 2015, PloS one.

[17]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[18]  Marco Y. Hein,et al.  Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ * , 2014, Molecular & Cellular Proteomics.

[19]  Samuel H Payne,et al.  A protein standard that emulates homology for the characterization of protein inference algorithms , 2017, bioRxiv.

[20]  William Stafford Noble,et al.  Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0 , 2016, Journal of The American Society for Mass Spectrometry.

[21]  Marco Y. Hein,et al.  The Perseus computational platform for comprehensive analysis of (prote)omics data , 2016, Nature Methods.

[22]  Dana Pascovici,et al.  Multiple testing corrections in quantitative proteomics: A useful but blunt tool , 2016, Proteomics.

[23]  Lennart Martens,et al.  moFF: a robust and automated approach to extract peptide ion intensities , 2016, Nature Methods.

[24]  M. Mann,et al.  Mass spectrometry–based proteomics turns quantitative , 2005, Nature chemical biology.

[25]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[26]  William Stafford Noble,et al.  Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. , 2013, Journal of proteomics.

[27]  Ruedi Aebersold,et al.  Statistical Approach to Protein Quantification* , 2013, Molecular & Cellular Proteomics.

[28]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[29]  E. Lundberg,et al.  Towards a knowledge-based Human Protein Atlas , 2010, Nature Biotechnology.

[30]  L. Käll,et al.  Covariation of Peptide Abundances Accurately Reflects Protein Concentration Differences , 2017, Molecular & Cellular Proteomics.

[31]  William Stafford Noble,et al.  Faster SEQUEST searching for peptide identification from tandem mass spectra. , 2011, Journal of proteome research.

[32]  Brendan MacLean,et al.  ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC-MS/MS Experiments. , 2017, Journal of proteome research.

[33]  Matthew The,et al.  How to talk about protein‐level false discovery rates in shotgun proteomics , 2016, Proteomics.

[34]  O. Serang,et al.  Nonparametric Bayesian evaluation of differential protein quantification. , 2013, Journal of proteome research.

[35]  Lukas Käll,et al.  DeMix-Q: Quantification-Centered Data Processing Workflow* , 2016, Molecular & Cellular Proteomics.

[36]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[37]  Lukas N. Mueller,et al.  An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. , 2008, Journal of proteome research.

[38]  Richard D Smith,et al.  Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. , 2015, Journal of proteome research.

[39]  Gordon K. Smyth,et al.  Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[40]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Yasset Perez-Riverol,et al.  A protein standard that emulates homology for the characterization of protein inference algorithms , 2017 .