Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms

Recently, predicting proteins three-dimensional (3D) structure from its sequence information has made a significant progress due to the advances in computational techniques and the growth of experimental structures. However, selecting good models from a structural model pool is an important and challenging task in protein structure prediction. In this study, we present the first application of random forest based model quality assessment (RFMQA) to rank protein models using its structural features and knowledge-based potential energy terms. The method predicts a relative score of a model by using its secondary structure, solvent accessibility and knowledge-based potential energy terms. We trained and tested the RFMQA method on CASP8 and CASP9 targets using 5-fold cross-validation. The correlation coefficient between the TM-score of the model selected by RFMQA (TMRF) and the best server model (TMbest) is 0.945. We benchmarked our method on recent CASP10 targets by using CASP8 and 9 server models as a training set. The correlation coefficient and average difference between TMRF and TMbest over 95 CASP10 targets are 0.984 and 0.0385, respectively. The test results show that our method works better in selecting top models when compared with other top performing methods. RFMQA is available for download from http://lee.kias.re.kr/RFMQA/RFMQA_eval.tar.gz.

[1]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[2]  B. Honig,et al.  Free energy determinants of tertiary structure and the evaluation of protein models , 2000, Protein science : a publication of the Protein Society.

[3]  M. Karplus,et al.  Discrimination of the native from misfolded protein models with an energy function including implicit solvation. , 1999, Journal of molecular biology.

[4]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[5]  J. Skolnick,et al.  GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction. , 2011, Biophysical journal.

[6]  Yang Zhang,et al.  A Novel Side-Chain Orientation Dependent Potential Derived from Random-Walk Reference State for Protein Fold Selection and Structure Prediction , 2010, PloS one.

[7]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[8]  Daisuke Kihara,et al.  Quality assessment of protein structure models. , 2009, Current protein & peptide science.

[9]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[10]  Jianwen Fang,et al.  Bioinformatic analysis of xenobiotic reactive metabolite target proteins and their interacting partners , 2009, BMC chemical biology.

[11]  Liam J. McGuffin,et al.  The ModFOLD server for the quality assessment of protein structural models , 2008, Bioinform..

[12]  Jianlin Cheng,et al.  Evaluating the absolute quality of a single protein model using structural features and support vector machines , 2009, Proteins.

[13]  Jooyoung Lee,et al.  Hidden Information Revealed by Optimal Community Structure from a Protein-Complex Bipartite Network Improves Protein Function Prediction , 2013, PloS one.

[14]  Arne Elofsson,et al.  Assessment of global and local model quality in CASP8 using Pcons and ProQ , 2009, Proteins.

[15]  Arne Elofsson,et al.  Prediction of global and local model quality in CASP7 using Pcons and ProQ , 2007, Proteins.

[16]  Björn Wallner,et al.  Improved model quality assessment using ProQ2 , 2012, BMC Bioinformatics.

[17]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Dong Xu,et al.  A sampling-based method for ranking protein structural models by integrating multiple scores and features. , 2011, Current protein & peptide science.

[20]  Jianlin Cheng,et al.  MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8 , 2010, Bioinform..

[21]  Silvio C. E. Tosatto,et al.  Global and local model quality estimation at CASP8 using the scoring functions QMEAN and QMEANclust , 2009, Proteins.

[22]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[23]  Liam J. McGuffin Prediction of global and local model quality in CASP8 using the ModFOLD server , 2009, Proteins.

[24]  Hongyi Zhou,et al.  Distance‐scaled, finite ideal‐gas reference state improves structure‐derived potentials of mean force for structure selection and stability prediction , 2002, Protein science : a publication of the Protein Society.

[25]  Anna Tramontano,et al.  Assessment of predictions in the model quality assessment category , 2007, Proteins.

[26]  Liangjiang Wang,et al.  Prediction of DNA-binding residues from protein sequence information using random forests , 2009, BMC Genomics.

[27]  Liam J. McGuffin,et al.  Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments , 2010, Bioinform..

[28]  Anna Tramontano,et al.  Evaluation of model quality predictions in CASP9 , 2011, Proteins.

[29]  Jianwen Fang,et al.  PROTS-RF: A Robust Model for Predicting Mutation-Induced Protein Stability Changes , 2012, PloS one.

[30]  Ceslovas Venclovas,et al.  Progress over the first decade of CASP experiments , 2005, Proteins.

[31]  Jianpeng Ma,et al.  OPUS-PSP: an orientation-dependent statistical all-atom potential derived from side-chain packing. , 2008, Journal of molecular biology.

[32]  Keehyoung Joo,et al.  Protein structure modeling for CASP10 by multiple layers of global optimization , 2014, Proteins.

[33]  Yaoqi Zhou,et al.  Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closely related all‐atom statistical energy functions , 2008, Protein science : a publication of the Protein Society.

[34]  Arne Elofsson,et al.  Automatic consensus‐based fold recognition using Pcons, ProQ, and Pmodeller , 2003, Proteins.

[35]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[36]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[37]  Pierre Baldi,et al.  Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data , 2005, Data Mining and Knowledge Discovery.

[38]  Anna Tramontano,et al.  Assessment of the assessment: Evaluation of the model quality estimates in CASP10 , 2014, Proteins.

[39]  Arne Elofsson,et al.  3D-Jury: A Simple Approach to Improve Protein Structure Predictions , 2003, Bioinform..

[40]  Keehyoung Joo,et al.  proteins STRUCTURE O FUNCTION O BIOINFORMATICS SANN: Solvent accessibility prediction of proteins , 2022 .

[41]  Yaoqi Zhou,et al.  Specific interactions for ab initio folding of protein terminal regions with secondary structures , 2008, Proteins.

[42]  John Moult,et al.  A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. , 2005, Current opinion in structural biology.

[43]  David Baker,et al.  Ranking predicted protein structures with support vector regression , 2007, Proteins.

[44]  Yang Zhang,et al.  I‐TASSER: Fully automated protein structure prediction in CASP8 , 2009, Proteins.

[45]  Jianlin Cheng,et al.  Prediction of global and local quality of CASP8 models by MULTICOM series , 2009, Proteins.

[46]  Jooyoung Lee,et al.  Improved network community structure improves function prediction , 2013, Scientific Reports.

[47]  Jianwen Fang,et al.  Feature Selection in Validating Mass Spectrometry Database Search Results , 2008, J. Bioinform. Comput. Biol..

[48]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.