EPIFANY – A method for efficient high-confidence protein inference

Accurate protein inference under the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient, but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data EPIFANY is the only tested method which finds all true-positive proteins at a 5% protein FDR without strict pre-filtering on PSM level, yielding an increase in identification performance (+10% in the number of true positives and +35% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[2]  Lev I Levitsky,et al.  Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on the Target-Decoy Approach. , 2017, Journal of proteome research.

[3]  Anthony J. Cesnik,et al.  Proteomics in non-human primates: utilizing RNA-Seq data to improve protein identification by mass spectrometry in vervet monkeys , 2017, BMC Genomics.

[4]  Quanhu Sheng,et al.  A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics , 2008, RECOMB.

[5]  Natalie I. Tasman,et al.  A Cross-platform Toolkit for Mass Spectrometry and Proteomics , 2012, Nature Biotechnology.

[6]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[7]  Mathias Wilhelm,et al.  A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets , 2015, Molecular & Cellular Proteomics.

[8]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[9]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[10]  Oliver Serang,et al.  The p-convolution forest: a method for solving graphical models with additive probabilistic equations , 2017, 1708.06448.

[11]  Martin Eisenacher,et al.  In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. , 2017, Journal of proteomics.

[12]  William Stafford Noble,et al.  A review of statistical methods for protein identification using tandem mass spectrometry. , 2012, Statistics and its interface.

[13]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[14]  Lukas Käll,et al.  Recognizing uncertainty increases robustness and reproducibility of mass spectrometry-based protein inferences. , 2012, Journal of proteome research.

[15]  Ian McGraw,et al.  Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing , 2006, UAI.

[16]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[17]  K. Reinert,et al.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis , 2016, Nature Methods.

[18]  Julianus Pfeuffer,et al.  A Bounded p-norm Approximation of Max-Convolution for Sub-Quadratic Bayesian Inference on Additive Factors , 2015, J. Mach. Learn. Res..

[19]  Peter Bühlmann,et al.  Protein and gene model inference based on statistical modeling in k-partite graphs , 2010, Proceedings of the National Academy of Sciences.

[20]  Oliver Serang,et al.  The Probabilistic Convolution Tree: Efficient Exact Bayesian Inference for Faster LC-MS/MS Protein Inference , 2014, PloS one.

[21]  Martin Eisenacher,et al.  PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface. , 2015, Journal of proteome research.

[22]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[23]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[24]  Martin Eisenacher,et al.  Protein inference using PIA workflows and PSI standard file formats , 2018, bioRxiv.

[25]  Steffen L. Lauritzen,et al.  Bayesian updating in causal probabilistic networks by local computations , 1990 .

[26]  William Stafford Noble,et al.  Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0 , 2016, Journal of The American Society for Mass Spectrometry.

[27]  John S. Cottrell,et al.  Protein identification using MS/MS data. , 2011, Journal of proteomics.

[28]  J. Yates,et al.  Protein analysis by shotgun/bottom-up proteomics. , 2013, Chemical reviews.

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[30]  Martin Eisenacher,et al.  The PRIDE database and related tools and resources in 2019: improving support for quantification data , 2018, Nucleic Acids Res..

[31]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[32]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[33]  A. Glavieux,et al.  Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[34]  Luis Mendoza,et al.  Trans‐Proteomic Pipeline, a standardized data processing pipeline for large‐scale reproducible proteomics informatics , 2015, Proteomics. Clinical applications.

[35]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[36]  Samuel H Payne,et al.  A protein standard that emulates homology for the characterization of protein inference algorithms , 2017, bioRxiv.