DeepQA: improving the estimation of single protein model quality with deep belief networks

BackgroundProtein quality assessment (QA) useful for ranking and selecting protein models has long been viewed as one of the major challenges for protein tertiary structure prediction. Especially, estimating the quality of a single protein model, which is important for selecting a few good models out of a large model pool consisting of mostly low-quality models, is still a largely unsolved problem.ResultsWe introduce a novel single-model quality assessment method DeepQA based on deep belief network that utilizes a number of selected features describing the quality of a model from different perspectives, such as energy, physio-chemical characteristics, and structural information. The deep belief network is trained on several large datasets consisting of models from the Critical Assessment of Protein Structure Prediction (CASP) experiments, several publicly available datasets, and models generated by our in-house ab initio method. Our experiments demonstrate that deep belief network has better performance compared to Support Vector Machines and Neural Networks on the protein model quality assessment problem, and our method DeepQA achieves the state-of-the-art performance on CASP11 dataset. It also outperformed two well-established methods in selecting good outlier models from a large set of models of mostly low quality generated by ab initio modeling methods.ConclusionDeepQA is a useful deep learning tool for protein single model quality assessment and protein structure prediction. The source code, executable, document and training/test datasets of DeepQA for Linux is freely available to non-commercial users at http://cactus.rnet.missouri.edu/DeepQA/.

[1]  Björn Wallner,et al.  Improved model quality assessment using ProQ2 , 2012, BMC Bioinformatics.

[2]  Karolis Uziela,et al.  ProQ2: estimation of model accuracy implemented in Rosetta , 2016, Bioinform..

[3]  Anna Tramontano,et al.  Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11 , 2016, Proteins.

[4]  Nir Ben-Tal,et al.  Quality assessment of protein model-structures using evolutionary conservation , 2010, Bioinform..

[5]  Qingguo Wang,et al.  MUFOLD: A new solution for protein 3D structure prediction , 2010, Proteins.

[6]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[7]  Liam J. McGuffin,et al.  Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments , 2010, Bioinform..

[8]  B Jayaram,et al.  Capturing native/native like structures with a physico-chemical metric (pcSM) in protein folding. , 2013, Biochimica et biophysica acta.

[9]  Jilong Li,et al.  Massive integration of diverse protein quality assessment methods to improve template based modeling in CASP11 , 2016, Proteins.

[10]  Andrej Sali,et al.  Comparative Protein Structure Modeling and its Applications to Drug Discovery , 2004 .

[11]  Jianlin Cheng,et al.  CONFOLD: Residue‐residue contact‐guided ab initio protein folding , 2015, Proteins.

[12]  Renzhi Cao,et al.  Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. , 2016, Methods.

[13]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[14]  Yang Zhang,et al.  I-TASSER server for protein 3D structure prediction , 2008, BMC Bioinformatics.

[15]  Jilong Li,et al.  A Stochastic Point Cloud Sampling Method for Multi-Template Protein Comparative Modeling , 2016, Scientific Reports.

[16]  A. Sali,et al.  Statistical potential for assessment and prediction of protein structures , 2006, Protein science : a publication of the Protein Society.

[17]  D. Eisenberg,et al.  Assessment of protein models with three-dimensional profiles , 1992, Nature.

[18]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[19]  K. Johnson An Update. , 1984, Journal of food protection.

[20]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[21]  Jianlin Cheng,et al.  Evaluating the absolute quality of a single protein model using structural features and support vector machines , 2009, Proteins.

[22]  Zheng Wang,et al.  Benchmarking Deep Networks for Predicting Residue-Specific Quality of Individual Protein Models in CASP11 , 2016, Scientific Reports.

[23]  Arne Elofsson,et al.  ProQ3: Improved model quality assessments using Rosetta energy terms , 2016, Scientific Reports.

[24]  Yaoqi Zhou,et al.  Specific interactions for ab initio folding of protein terminal regions with secondary structures , 2008, Proteins.

[25]  A. Sali,et al.  Comparative protein structure modeling by iterative alignment, model building and model assessment. , 2003, Nucleic Acids Research.

[26]  Badri Adhikari,et al.  CONFOLD: residue-residue contact-guided ab initio protein folding , 2015 .

[27]  Renzhi Cao,et al.  UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling , 2016, Bioinform..

[28]  Nazri Mohd Nawi,et al.  An Improved Learning Algorithm Based on The Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method For Back Propagation Neural Networks , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[29]  Jianpeng Ma,et al.  OPUS‐Ca: A knowledge‐based potential function requiring only Cα positions , 2007, Protein science : a publication of the Protein Society.

[30]  Yan Wang,et al.  ResQ: An Approach to Unified Estimation of B-Factor and Residue-Specific Error in Protein Structure Prediction. , 2016, Journal of molecular biology.

[31]  J. Skolnick,et al.  GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction. , 2011, Biophysical journal.

[32]  Jianlin Cheng,et al.  Predicting protein residue-residue contacts using deep networks and boosting , 2012, Bioinform..

[33]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[34]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[35]  Dong Xu,et al.  FALCON@home: a high-throughput protein structure prediction server based on remote homologue recognition , 2016, Bioinform..

[36]  Shuai Cheng Li,et al.  Fragment‐HMM: A new approach to protein structure prediction , 2008, Protein science : a publication of the Protein Society.

[37]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[38]  Taeho Jo,et al.  Evaluation of Protein Structural Models Using Random Forests , 2016, ArXiv.

[39]  Jilong Li,et al.  Large-scale model quality assessment for improving protein tertiary structure prediction , 2015, Bioinform..

[40]  Marco Biasini,et al.  Toward the estimation of the absolute quality of individual protein structure models , 2010, Bioinform..

[41]  András Fiser,et al.  Effects of amino acid composition, finite size of proteins, and sparse statistics on distance‐dependent statistical pair potentials , 2007, Proteins.

[42]  Qingguo Wang,et al.  MUFOLD‐WQA: A new selective consensus method for quality assessment in protein structure prediction , 2011, Proteins.

[43]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[44]  Yang Zhang,et al.  3DRobot: automated generation of diverse and well-packed protein structure decoys , 2016, Bioinform..

[45]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46]  Jinbo Xu,et al.  Raptorx: Exploiting structure information for protein alignment by statistical inference , 2011, Proteins.

[47]  Yang Zhang,et al.  A Novel Side-Chain Orientation Dependent Potential Derived from Random-Walk Reference State for Protein Fold Selection and Structure Prediction , 2010, PloS one.

[48]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[49]  Debswapna Bhattacharya,et al.  De novo protein conformational sampling using a probabilistic graphical model , 2015, Scientific Reports.

[50]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[51]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[52]  Renzhi Cao,et al.  SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines , 2013, BMC Bioinformatics.

[53]  Jilong Li,et al.  A large-scale conformation sampling and evaluation server for protein tertiary structure prediction and its assessment in CASP11 , 2015, BMC Bioinformatics.

[54]  Roderic D. M. Page,et al.  TreeView: an application to display phylogenetic trees on personal computers , 1996, Comput. Appl. Biosci..

[55]  Zheng Wang,et al.  Designing and evaluating the MULTICOM protein local and global model quality prediction methods in the CASP10 experiment , 2014, BMC Structural Biology.

[56]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[57]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[58]  Jilong Li,et al.  The MULTICOM protein tertiary structure prediction system. , 2014, Methods in molecular biology.

[59]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[60]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[61]  Liam J. McGuffin,et al.  The ModFOLD server for the quality assessment of protein structural models , 2008, Bioinform..

[62]  Miao Sun,et al.  Generic Object Detection with Dense Neural Patterns and Regionlets , 2014, BMVC.

[63]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..