Predicting residue-residue contacts using random forest models

MOTIVATION Protein residue-residue contact prediction can be useful in predicting protein 3D structures. Current algorithms for such a purpose leave room for improvement. RESULTS We develop ProC_S3, a set of Random Forest algorithm-based models, for predicting residue-residue contact maps. The models are constructed based on a collection of 1490 non-redundant, high-resolution protein structures using >1280 sequence-based features. A new amino acid residue contact propensity matrix and a new set of seven amino acid groups based on contact preference are developed and used in ProC_S3. ProC_S3 delivers a 3-fold cross-validated accuracy of 26.9% with coverage of 4.7% for top L/5 predictions (L is the number of residues in a protein) of long-range contacts (sequence separation ≥24). Further benchmark tests deliver an accuracy of 29.7% and coverage of 5.6% for an independent set of 329 proteins. In the recently completed Ninth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP9), ProC_S3 is ranked as No. 1, No. 3, and No. 2 accuracies in the top L/5, L/10 and best 5 predictions of long-range contacts, respectively, among 18 automatic prediction servers. AVAILABILITY http://www.abl.ku.edu/proc/proc_s3.html. CONTACT jwfang@ku.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Jianwen Fang,et al.  Feature Selection in Validating Mass Spectrometry Database Search Results , 2008, J. Bioinform. Comput. Biol..

[2]  J. Skolnick,et al.  TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[3]  A. Szilágyi,et al.  Improving protein structure prediction using multiple sequence-based contact predictions. , 2011, Structure.

[4]  D. Baker,et al.  Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design , 2005, Nucleic acids research.

[5]  De-Shuang Huang,et al.  Prediction of inter-residue contacts map based on genetic algorithm optimized radial basis function neural network and binary input encoding scheme , 2004, J. Comput. Aided Mol. Des..

[6]  W. Braun,et al.  Sequence specificity, statistical potentials, and three‐dimensional structure prediction with self‐correcting distance geometry calculations of β‐sheet formation in proteins , 2008 .

[7]  Yang Zhang,et al.  REMO: A new protocol to refine full atomic protein models from C‐alpha traces by optimizing hydrogen‐bonding networks , 2009, Proteins.

[8]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[9]  M. Levitt,et al.  A lattice model for protein structure prediction at low resolution. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[11]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[12]  P Fariselli,et al.  Progress in predicting inter‐residue contacts of proteins with neural networks and correlated mutations , 2001, Proteins.

[13]  Yang Zhang,et al.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..

[14]  Bin Xue,et al.  Predicting residue–residue contact maps by a two‐layer, integrated neural‐network method , 2009, Proteins.

[15]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[16]  Frank Eisenhaber,et al.  Improved strategy in analytic surface calculation for molecular systems: Handling of singularities and computational efficiency , 1993, J. Comput. Chem..

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Alessandro Vullo,et al.  A two-stage approach for improved prediction of residue contact maps , 2006, BMC Bioinformatics.

[20]  Yang Zhang,et al.  I‐TASSER: Fully automated protein structure prediction in CASP8 , 2009, Proteins.

[21]  Pierre Baldi,et al.  SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs , 2008, BMC Structural Biology.

[22]  Peng Chen,et al.  Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers , 2010, BMC Structural Biology.

[23]  George Karypis,et al.  Prediction of contact maps using support vector machines , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[24]  Piero Fariselli,et al.  Improved prediction of the number of residue contacts in proteins by recurrent neural networks , 2001, ISMB.

[25]  Robert M. MacCallum,et al.  Striped sheets and protein contact prediction , 2004, ISMB/ECCB.

[26]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[27]  Jianhan Chen,et al.  Can molecular dynamics simulations provide high‐resolution refinement of protein structure? , 2007, Proteins.

[28]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[29]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[30]  Christopher Bystroff,et al.  Predicting interresidue contacts using templates and pathways , 2003, Proteins.

[31]  Torgeir R. Hvidsten,et al.  Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts , 2009, Bioinform..

[32]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[33]  A. Tramontano,et al.  Evaluation of residue–residue contact predictions in CASP9 , 2011, Proteins.

[34]  Liangjiang Wang,et al.  Prediction of DNA-binding residues from protein sequence information using random forests , 2009, BMC Genomics.

[35]  Burkhard Rost,et al.  EVAcon: a protein contact prediction evaluation service , 2005, Nucleic Acids Res..

[36]  R. Casadio,et al.  A neural network based predictor of residue contacts in proteins. , 1999, Protein engineering.

[37]  Jianlin Cheng,et al.  NNcon: improved protein contact map prediction using 2D-recursive neural networks , 2009, Nucleic Acids Res..

[38]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..