NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers

Motivation: Recent CASP experiments have witnessed exciting progress on folding large‐size non‐humongous proteins with the assistance of co‐evolution based contact predictions. The success is however anecdotal due to the requirement of the contact prediction methods for the high volume of sequence homologs that are not available to most of the non‐humongous protein targets. Development of efficient methods that can generate balanced and reliable contact maps for different type of protein targets is essential to enhance the success rate of the ab initio protein structure prediction. Results: We developed a new pipeline, NeBcon, which uses the naïve Bayes classifier (NBC) theorem to combine eight state of the art contact methods that are built from co‐evolution and machine learning approaches. The posterior probabilities of the NBC model are then trained with intrinsic structural features through neural network learning for the final contact map prediction. NeBcon was tested on 98 non‐redundant proteins, which improves the accuracy of the best co‐evolution based meta‐server predictor by 22%; the magnitude of the improvement increases to 45% for the hard targets that lack sequence and structural homologs in the databases. Detailed data analysis showed that the major contribution to the improvement is due to the optimized NBC combination of the complementary information from both co‐evolution and machine learning predictions. The neural network training also helps to improve the coupling of the NBC posterior probability and the intrinsic structural features, which were found particularly important for the proteins that do not have sufficient number of homologous sequences to derive reliable co‐evolution profiles. Availiablity and Implementation: On‐line server and standalone package of the program are available at http://zhanglab.ccmb.med.umich.edu/NeBcon/. Contact: zhng@umich.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  David E. Kim,et al.  Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta , 2016, Proteins.

[2]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[3]  Yang Zhang,et al.  A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction , 2013, Scientific Reports.

[4]  Yang Zhang,et al.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..

[5]  Lisa N Kinch,et al.  Evaluation of free modeling targets in CASP11 and ROLL , 2016, Proteins.

[6]  Kevin Karplus,et al.  Contact prediction using mutual information and neural nets , 2007, Proteins.

[7]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[8]  Jinbo Xu,et al.  Improving consensus contact prediction via server correlation reduction , 2009, BMC Structural Biology.

[9]  Pierre Baldi,et al.  Three-stage prediction of protein ?-sheets by neural networks, alignments and graph algorithms , 2005, ISMB.

[10]  Malgorzata Kotulska,et al.  Automated Procedure for Contact-Map-Based Protein Structure Reconstruction , 2014, The Journal of Membrane Biology.

[11]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[12]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[13]  Alfonso Valencia,et al.  Assessment of intramolecular contact predictions for CASP7 , 2007, Proteins.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Sitao Wu,et al.  LOMETS: A local meta-threading-server for protein structure prediction , 2007, Nucleic acids research.

[16]  A. Szilágyi,et al.  Improving protein structure prediction using multiple sequence-based contact predictions. , 2011, Structure.

[17]  Marcin J. Skwark,et al.  Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns , 2014, PLoS Comput. Biol..

[18]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[19]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[20]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[21]  J. Skolnick,et al.  MONSSTER: a method for folding globular proteins with a small number of distance restraints. , 1997, Journal of molecular biology.

[22]  C. Sander,et al.  Correlated Mutations and Residue Contacts , 1994 .

[23]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[24]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[25]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[26]  Yang Zhang,et al.  Ab initio protein structure assembly using continuous structure fragments and optimized knowledge‐based force field , 2012, Proteins.

[27]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[28]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[29]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[30]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[31]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[32]  A. Tramontano,et al.  New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins.

[33]  J. Skolnick,et al.  TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[34]  Burkhard Rost,et al.  FreeContact: fast and free software for protein contact prediction from residue co-evolution , 2014, BMC Bioinformatics.

[35]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[36]  Anna Tramontano,et al.  Evaluation of residue–residue contact prediction in CASP10 , 2014, Proteins.

[37]  A. Tramontano,et al.  Evaluation of residue–residue contact predictions in CASP9 , 2011, Proteins.

[38]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.