Machine Learning Approaches for Quality Assessment of Protein Structures

Protein structures play a very important role in biomedical research, especially in drug discovery and design, which require accurate protein structures in advance. However, experimental determinations of protein structure are prohibitively costly and time-consuming, and computational predictions of protein structures have not been perfected. Methods that assess the quality of protein models can help in selecting the most accurate candidates for further work. Driven by this demand, many structural bioinformatics laboratories have developed methods for estimating model accuracy (EMA). In recent years, EMA by machine learning (ML) have consistently ranked among the top-performing methods in the community-wide CASP challenge. Accordingly, we systematically review all the major ML-based EMA methods developed within the past ten years. The methods are grouped by their employed ML approach—support vector machine, artificial neural networks, ensemble learning, or Bayesian learning—and their significances are discussed from a methodology viewpoint. To orient the reader, we also briefly describe the background of EMA, including the CASP challenge and its evaluation metrics, and introduce the major ML/DL techniques. Overall, this review provides an introductory guide to modern research on protein quality assessment and directions for future research in this area.

[1]  David T Jones,et al.  Recent developments in deep learning applied to protein structure prediction , 2019, Proteins.

[2]  Jianzhu Ma,et al.  Protein structure alignment beyond spatial proximity , 2013, Scientific Reports.

[3]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using MODELLER , 2016, Current protocols in bioinformatics.

[4]  Xiangrong Liu,et al.  Machine Learning for Drug-Target Interaction Prediction , 2018, Molecules.

[5]  Liam J. McGuffin,et al.  ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models , 2017, Nucleic Acids Res..

[6]  Minkyung Baek,et al.  Assessment of protein model structure accuracy estimation in CASP13: Challenges in the era of deep learning , 2019, Proteins.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Anna Tramontano,et al.  Evaluation of CASP8 model quality predictions , 2009, Proteins.

[9]  Andreas Persidis,et al.  Artificial intelligence for drug design , 1997, Nature Biotechnology.

[10]  Laurent Emmanuel Dardenne,et al.  Estimating Protein Structure Prediction Models Quality Using Convolutional Neural Networks , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[11]  Kei Yura,et al.  [Structural bioinformatics]. , 2009, Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme.

[12]  Arne Elofsson,et al.  Methods for estimation of model accuracy in CASP12 , 2017, bioRxiv.

[13]  Yang Zhang,et al.  3DRobot: automated generation of diverse and well-packed protein structure decoys , 2016, Bioinform..

[14]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[15]  Krzysztof Fidelis,et al.  Processing and evaluation of predictions in CASP4 , 2001, Proteins.

[16]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[17]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[18]  Jan Hauke,et al.  Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data , 2011 .

[19]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[20]  Jie Hou,et al.  DNCON2: improved protein contact prediction using two-level deep convolutional neural networks , 2017, bioRxiv.

[21]  R. Schulz,et al.  Protein Structure Prediction , 2020, Methods in Molecular Biology.

[22]  Yoshua Bengio,et al.  Deep convolutional networks for quality assessment of protein folds , 2018, Bioinform..

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[26]  Yang Zhang,et al.  I-TASSER server for protein 3D structure prediction , 2008, BMC Bioinformatics.

[27]  Jacek Blazewicz,et al.  SphereGrinder - reference structure-based tool for quality assessment of protein structural models , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[28]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using Modeller , 2006, Current protocols in bioinformatics.

[29]  Pushmeet Kohli,et al.  Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13) , 2019, Proteins.

[30]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[31]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[32]  Arne Elofsson,et al.  Identification of correct regions in protein models using structural, alignment, and consensus information , 2006, Protein science : a publication of the Protein Society.

[33]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[34]  Marco Biasini,et al.  lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests , 2013, Bioinform..

[35]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[36]  Dong Xu,et al.  FALCON@home: a high-throughput protein structure prediction server based on remote homologue recognition , 2016, Bioinform..

[37]  Miao Sun,et al.  AngularQA: Protein Model Quality Assessment with LSTM Networks , 2019 .

[38]  Arne Elofsson,et al.  ProQ3D: improved model quality assessments using deep learning , 2016, Bioinform..

[39]  Torsten Schwede,et al.  Assessment of model accuracy estimations in CASP12 , 2018, Proteins.

[40]  Takashi Ishida,et al.  Protein model accuracy estimation based on local structure quality assessment using 3D convolutional neural network , 2019, PloS one.

[41]  Juergen Haas,et al.  The Protein Model Portal—a comprehensive resource for protein structure and model information , 2013, Database J. Biol. Databases Curation.

[42]  M. Mukaka,et al.  Statistics corner: A guide to appropriate use of correlation coefficient in medical research. , 2012, Malawi medical journal : the journal of Medical Association of Malawi.

[43]  Tamar Schlick,et al.  Molecular Modeling and Simulation: An Interdisciplinary Guide , 2010 .

[44]  Nazri Mohd Nawi,et al.  An Improved Learning Algorithm Based on The Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method For Back Propagation Neural Networks , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[45]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[46]  이상헌,et al.  Deep Belief Networks , 2010, Encyclopedia of Machine Learning.

[47]  Torsten Schwede,et al.  QMEANDisCo—distance constraints applied on model quality estimation , 2019, Bioinform..

[48]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[49]  Yang Zhang,et al.  A Novel Side-Chain Orientation Dependent Potential Derived from Random-Walk Reference State for Protein Fold Selection and Structure Prediction , 2010, PloS one.

[50]  Z. Luthey-Schulten,et al.  Ab initio protein structure prediction. , 2002, Current opinion in structural biology.

[51]  H. Abdi The Kendall Rank Correlation Coefficient , 2007 .

[52]  Anna Tramontano,et al.  Assessment of predictions in the model quality assessment category , 2007, Proteins.

[53]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[54]  Anna Tramontano,et al.  Assessment of the assessment: Evaluation of the model quality estimates in CASP10 , 2014, Proteins.

[55]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[56]  Paulo S. C. Alencar,et al.  The use of machine learning algorithms in recommender systems: A systematic review , 2015, Expert Syst. Appl..

[57]  Dong Xu,et al.  DL-PRO: A novel deep learning method for protein model quality assessment , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[58]  Jilong Li,et al.  Large-scale model quality assessment for improving protein tertiary structure prediction , 2015, Bioinform..

[59]  Torsten Schwede,et al.  SWISS-MODEL: homology modelling of protein structures and complexes , 2019 .

[60]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[61]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[62]  Arne Elofsson,et al.  ProQ3: Improved model quality assessments using Rosetta energy terms , 2016, Scientific Reports.

[63]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[64]  Liam J McGuffin,et al.  IntFOLD: an integrated web resource for high performance protein structure and function prediction , 2019, Nucleic Acids Res..

[65]  Kliment Olechnovič,et al.  CAD‐score: A new contact area difference‐based function for evaluation of protein structural models , 2013, Proteins.

[66]  Jianlin Cheng,et al.  Evaluating the absolute quality of a single protein model using structural features and support vector machines , 2009, Proteins.

[67]  Liam J. McGuffin,et al.  The ModFOLD server for the quality assessment of protein structural models , 2008, Bioinform..

[68]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[69]  Zheng Wang,et al.  Benchmarking Deep Networks for Predicting Residue-Specific Quality of Individual Protein Models in CASP11 , 2016, Scientific Reports.

[70]  S. Jones,et al.  Protein-RNA interactions: a structural analysis. , 2001, Nucleic acids research.

[71]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[72]  Yaoqi Zhou,et al.  Specific interactions for ab initio folding of protein terminal regions with secondary structures , 2008, Proteins.

[73]  Arne Elofsson,et al.  Estimation of model accuracy in CASP13 , 2019, Proteins.

[74]  Kliment Olechnovič,et al.  VoroMQA: Assessment of protein structure quality using interatomic contact areas , 2017, Proteins.

[75]  Renzhi Cao,et al.  Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13 , 2019, bioRxiv.

[76]  Anna R. Panchenko,et al.  Computational Approaches to Prioritize Cancer Driver Missense Mutations , 2018, International journal of molecular sciences.

[77]  Arne Elofsson,et al.  Deep transfer learning in the assessment of the quality of protein models , 2018, 1804.06281.

[78]  Mohammed AlQuraishi,et al.  ProteinNet: a standardized data set for machine learning of protein structure , 2019, BMC Bioinformatics.

[79]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[80]  Brian Kuhlman,et al.  Advances in protein structure prediction and design , 2019, Nature Reviews Molecular Cell Biology.

[81]  Andrej Sali,et al.  Comparative Protein Structure Modeling and its Applications to Drug Discovery , 2004 .

[82]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[83]  Guanyu Wang,et al.  Machine Learning Based Toxicity Prediction: From Chemical Structural Description to Transcriptome Analysis , 2018, International journal of molecular sciences.

[84]  Charlotte Deane,et al.  Co-evolution techniques are reshaping the way we do structural bioinformatics , 2017, F1000Research.

[85]  Yan Wang,et al.  ResQ: An Approach to Unified Estimation of B-Factor and Residue-Specific Error in Protein Structure Prediction. , 2016, Journal of molecular biology.

[86]  Chen Keasar,et al.  Purely Structural Protein Scoring Functions Using Support Vector Machine and Ensemble Learning , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[87]  J. Skolnick,et al.  Ab initio modeling of small proteins by iterative TASSER simulations , 2007, BMC Biology.

[88]  Anna Tramontano,et al.  Evaluation of model quality predictions in CASP9 , 2011, Proteins.

[89]  Anna Tramontano,et al.  Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11 , 2016, Proteins.

[90]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[91]  Liam J. McGuffin,et al.  Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments , 2010, Bioinform..

[92]  Liam J. McGuffin,et al.  The ModFOLD4 server for the quality assessment of 3D protein models , 2013, Nucleic Acids Res..

[93]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[94]  Björn Wallner,et al.  Improved model quality assessment using ProQ2 , 2012, BMC Bioinformatics.

[95]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[96]  Xiangang Li,et al.  Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  J. Skolnick,et al.  GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction. , 2011, Biophysical journal.

[98]  Cha Zhang,et al.  Ensemble Machine Learning: Methods and Applications , 2012 .

[99]  Dong Xu,et al.  Two New Heuristic Methods for Protein Model Quality Assessment , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[100]  Björn Wallner,et al.  rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments , 2019, PloS one.

[101]  S. Bryant,et al.  Critical assessment of methods of protein structure prediction (CASP): Round II , 1997, Proteins.

[102]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[103]  Sergei Grudinin,et al.  Protein model quality assessment using 3D oriented convolutional neural networks , 2018, bioRxiv.

[104]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[105]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[106]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[107]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[108]  Jian Peng,et al.  Template-based protein structure modeling using the RaptorX web server , 2012, Nature Protocols.