Challenges in proteogenomics: a comparison of analysis methods with the case study of the DREAM proteogenomics sub-challenge

Proteomic measurements, which closely reflect phenotypes, provide insights into gene expression regulations and mechanisms underlying altered phenotypes. Further, integration of data on proteome and transcriptome levels can validate gene signatures associated with a phenotype. However, proteomic data is not as abundant as genomic data, and it is thus beneficial to use genomic features to predict protein abundances when matching proteomic samples or measurements within samples are lacking. We evaluate and compare four data-driven models for prediction of proteomic data from mRNA measured in breast and ovarian cancers using the 2017 DREAM Proteogenomics Challenge data. Our results show that Bayesian network, random forests, LASSO, and fuzzy logic approaches can predict protein abundance levels with median ground truth-predicted correlation values between 0.2 and 0.5. However, the most accurately predicted proteins differ considerably between approaches. In addition to benchmarking aforementioned machine learning approaches for predicting protein levels from transcript levels, we discuss challenges and potential solutions in state-of-the-art proteogenomic analyses.

[1]  Cengizhan Ozturk,et al.  Bayesian network prior: network analysis of biological data using external knowledge , 2013, Bioinform..

[2]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[3]  L. Brown,et al.  Exhaled Breath Condensate: A Promising Source for Biomarkers of Lung Disease , 2012, TheScientificWorldJournal.

[4]  R. Aebersold,et al.  On the Dependency of Cellular Protein Levels on mRNA Abundance , 2016, Cell.

[5]  Alexander Litvinenko,et al.  Application of bayesian networks for estimation of individual psychological characteristics , 2017, PRZEGLĄD ELEKTROTECHNICZNY.

[6]  M. Gerstein,et al.  Comparing protein abundance and mRNA expression levels on a genomic scale , 2003, Genome Biology.

[7]  Robert A Jacobs,et al.  Bayesian learning theory applied to human cognition. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[8]  Mikael Bodén,et al.  Predicting the Dynamics of Protein Abundance , 2014, Molecular & Cellular Proteomics.

[9]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[10]  Ronald R. Yager,et al.  Fuzzy prediction based on regression models , 1982, Inf. Sci..

[11]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[12]  Pei Wang,et al.  The limitation of Bayesianism , 2004, Artif. Intell..

[13]  Jay D Keasling,et al.  Effect of copy number and mRNA processing and stabilization on transcript and protein levels from an engineered dual-gene operon. , 2002, Biotechnology and bioengineering.

[14]  A. Márcia Barbosa,et al.  Obtaining Environmental Favourability Functions from Logistic Regression , 2006, Environmental and Ecological Statistics.

[15]  Ronald J. Moore,et al.  Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer , 2016, Cell.

[16]  A. Márcia Barbosa,et al.  Applying Fuzzy Logic to Comparative Distribution Modelling: A Case Study with Two Sympatric Amphibians , 2012, TheScientificWorldJournal.

[17]  Yaniv Gurwicz,et al.  Constructing Deep Neural Networks by Bayesian Network Structure Learning , 2018, NeurIPS.

[18]  Hong Yue,et al.  Identification of functional connections in biological neural networks using dynamical Bayesian networks , 2016 .

[19]  Andrew J. Bulpitt,et al.  A Primer on Learning in Bayesian Networks for Computational Biology , 2007, PLoS Comput. Biol..

[20]  K. Parker,et al.  Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Amine-reactive Isobaric Tagging Reagents*S , 2004, Molecular & Cellular Proteomics.

[21]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[22]  Ulrike von Luxburg,et al.  When do random forests fail? , 2018, NeurIPS.

[23]  G. Hommel,et al.  Linear regression analysis: part 14 of a series on evaluation of scientific publications. , 2010, Deutsches Arzteblatt international.

[24]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[25]  Daniel McNeish,et al.  On Using Bayesian Methods to Address Small Sample Problems , 2016 .

[26]  Simen Myhre,et al.  Influence of DNA copy number and mRNA levels on the expression of breast cancer related proteins , 2013, Molecular oncology.

[27]  James M. Keller,et al.  Bioinformatics and Fuzzy Logic , 2006, 2006 IEEE International Conference on Fuzzy Systems.

[28]  Ben C. Collins,et al.  Quantitative proteomics: challenges and opportunities in basic and applied research , 2017, Nature Protocols.

[29]  Karl-Friedrich Becker,et al.  Reverse Phase Protein Arrays—Quantitative Assessment of Multiple Biomarkers in Biopsies for Clinical Use , 2015, Microarrays.

[30]  R. van de Schoot,et al.  Analyzing small data sets using Bayesian estimation: the case of posttraumatic stress symptoms following mechanical ventilation in burn survivors , 2015, European journal of psychotraumatology.

[31]  N. Kendrick,et al.  A gene ' s mRNA level does not usually predict its protein level , 2014 .