EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction

BackgroundAccurately predicted contacts allow to compute the 3D structure of a protein. Since the solution space of native residue-residue contact pairs is very large, it is necessary to leverage information to identify relevant regions of the solution space, i.e. correct contacts. Every additional source of information can contribute to narrowing down candidate regions. Therefore, recent methods combined evolutionary and sequence-based information as well as evolutionary and physicochemical information. We develop a new contact predictor (EPSILON-CP) that goes beyond current methods by combining evolutionary, physicochemical, and sequence-based information. The problems resulting from the increased dimensionality and complexity of the learning problem are combated with a careful feature analysis, which results in a drastically reduced feature set. The different information sources are combined using deep neural networks.ResultsOn 21 hard CASP11 FM targets, EPSILON-CP achieves a mean precision of 35.7% for top- L/10 predicted long-range contacts, which is 11% better than the CASP11 winning version of MetaPSICOV. The improvement on 1.5L is 17%. Furthermore, in this study we find that the amino acid composition, a commonly used feature, is rendered ineffective in the context of meta approaches. The size of the refined feature set decreased by 75%, enabling a significant increase in training data for machine learning, contributing significantly to the observed improvements.ConclusionsExploiting as much and diverse information as possible is key to accurate contact prediction. Simply merging the information introduces new challenges. Our study suggests that critical feature analysis can improve the performance of contact prediction methods that combine multiple information sources. EPSILON-CP is available as a webservice: http://compbio.robotics.tu-berlin.de/epsilon/

[1]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[2]  David E. Kim,et al.  Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta , 2016, Proteins.

[3]  A. Szilágyi,et al.  Improving protein structure prediction using multiple sequence-based contact predictions. , 2011, Structure.

[4]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[5]  M. Vassura,et al.  Reconstruction of 3D Structures From Protein Contact Maps , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[7]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[8]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[9]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[10]  Marco Punta,et al.  Protein folding rates estimated from contact predictions. , 2005, Journal of molecular biology.

[11]  M. Gromiha,et al.  Comparison between long-range interactions and contact order in determining the folding rate of two-state proteins: application of long-range order to folding rate prediction. , 2001, Journal of molecular biology.

[12]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[13]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[14]  Marcin J. Skwark,et al.  Improving Contact Prediction along Three Dimensions , 2014, PLoS Comput. Biol..

[15]  Marcin J. Skwark,et al.  PconsFold: improved contact predictions improve protein models , 2014, Bioinform..

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Manju Bansal,et al.  An ensemble of B-DNA dinucleotide geometries lead to characteristic nucleosomal DNA structure and provide plasticity required for gene expression , 2011, BMC Structural Biology.

[18]  Zhiyong Wang,et al.  Predicting protein contact map using evolutionary and physical constraints by integer programming , 2013, Bioinform..

[19]  Qianqian Zhu,et al.  How well can we predict native contacts in proteins based on decoy structures and their energies? , 2003, Proteins.

[20]  Richard Bonneau,et al.  Contact order and ab initio protein structure prediction , 2002, Protein science : a publication of the Protein Society.

[21]  R. Jernigan,et al.  Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation , 1985 .

[22]  Piero Fariselli,et al.  FT-COMAR: fault tolerant three-dimensional structure reconstruction from protein contact maps , 2008, Bioinform..

[23]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[24]  Daniel P. W. Ellis,et al.  Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[25]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[26]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[27]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[28]  Jianlin Cheng,et al.  Predicting protein residue-residue contacts using deep networks and boosting , 2012, Bioinform..

[29]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[30]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[31]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[32]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[33]  Helen Schneider,et al.  Community Care Workers, Poor Referral Networks and Consumption of Personal Resources in Rural South Africa , 2014, PloS one.

[34]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[35]  K. Dill,et al.  Cooperativity in protein-folding kinetics. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Rodrigo Lopez,et al.  Analysis Tool Web Services from the EMBL-EBI , 2013, Nucleic Acids Res..

[37]  Jian Yang,et al.  Feature fusion: parallel strategy vs. serial strategy , 2003, Pattern Recognit..

[38]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  D. Baker,et al.  Contact order, transition state placement and the refolding rates of single domain proteins. , 1998, Journal of molecular biology.

[41]  Oliver Brock,et al.  RBO Aleph: leveraging novel information sources for protein structure prediction , 2015, Nucleic Acids Res..

[42]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[43]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[44]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Yves A. Lussier,et al.  Evaluation of high-throughput functional categorization of human disease genes , 2007, BMC Bioinformatics.

[46]  David T. Jones,et al.  De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts , 2014, PloS one.

[47]  Burkhard Rost,et al.  FreeContact: fast and free software for protein contact prediction from residue co-evolution , 2014, BMC Bioinformatics.

[48]  O. Brock,et al.  Combining Physicochemical and Evolutionary Information for Protein Contact Prediction , 2014, PloS one.

[49]  Carlo Baldassi,et al.  Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners , 2014, PloS one.

[50]  Yang Zhang,et al.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..

[51]  D. Thirumalai,et al.  Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes , 2008, Protein science : a publication of the Protein Society.

[52]  Jianlin Cheng,et al.  CONFOLD: Residue‐residue contact‐guided ab initio protein folding , 2015, Proteins.

[53]  Jianwen Fang,et al.  Predicting residue-residue contacts using random forest models , 2011, Bioinform..

[54]  David C. Jones Predicting novel protein folds by using FRAGFOLD , 2001, Proteins.

[55]  Huajun Chen,et al.  Introduction to semantic e-Science in biomedicine , 2007, BMC Bioinformatics.

[56]  Yang Zhang,et al.  Application of sparse NMR restraints to large-scale protein structure prediction. , 2004, Biophysical journal.

[57]  Jianlin Cheng,et al.  A conformation ensemble approach to protein residue-residue contact , 2011, BMC Structural Biology.

[58]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[59]  G. Orlando,et al.  Observation selection bias in contact prediction and its implications for structural bioinformatics , 2016, Scientific Reports.

[60]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[61]  Jens Meiler,et al.  BCL: : Contact-Low Confidence Fold Recognition Hits Boost Protein Contact Prediction and De Novo Structure Determination , 2010, J. Comput. Biol..

[62]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[63]  Garland R. Marshall,et al.  Properties of intraglobular contacts in proteins: an approach to prediction of tertiary structure , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[64]  Marcin J. Skwark,et al.  Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns , 2014, PLoS Comput. Biol..

[65]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[66]  Yang Zhang,et al.  High-accuracy prediction of transmembrane inter-helix contacts and application to GPCR 3D structure modeling , 2013, Bioinform..

[67]  M. Michael Gromiha,et al.  Multiple Contact Network Is a Key Determinant to Protein Folding Rates , 2009, J. Chem. Inf. Model..

[68]  David A. Lee,et al.  CATH: an expanded resource to predict protein function through structure and sequence , 2016, Nucleic Acids Res..