Machine‐learning scoring functions to improve structure‐based binding affinity prediction and virtual screening

Docking tools to predict whether and how a small molecule binds to a target can be applied if a structural model of such target is available. The reliability of docking depends, however, on the accuracy of the adopted scoring function (SF). Despite intense research over the years, improving the accuracy of SFs for structure‐based binding affinity prediction or virtual screening has proven to be a challenging task for any class of method. New SFs based on modern machine‐learning regression models, which do not impose a predetermined functional form and thus are able to exploit effectively much larger amounts of experimental data, have recently been introduced. These machine‐learning SFs have been shown to outperform a wide range of classical SFs at both binding affinity prediction and virtual screening. The emerging picture from these studies is that the classical approach of using linear regression with a small number of expert‐selected structural features can be strongly improved by a machine‐learning approach based on nonlinear regression allied with comprehensive data‐driven feature selection. Furthermore, the performance of classical SFs does not grow with larger training datasets and hence this performance gap is expected to widen as more training data becomes available in the future. Other topics covered in this review include predicting the reliability of a SF on a particular target class, generating synthetic data to improve predictive performance and modeling guidelines for SF development. WIREs Comput Mol Sci 2015, 5:405–424. doi: 10.1002/wcms.1225

[1]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[2]  John B. O. Mitchell,et al.  Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification , 2012, Journal of The Royal Society Interface.

[3]  Zhihai Liu,et al.  Comparative Assessment of Scoring Functions on an Updated Benchmark: 2. Evaluation Methods and General Results , 2014, J. Chem. Inf. Model..

[4]  Teruki Honma,et al.  Combining Machine Learning and Pharmacophore-Based Interaction Fingerprint for in Silico Screening , 2010, J. Chem. Inf. Model..

[5]  Philip E. Bourne,et al.  Correction to "A Machine Learning-Based Method To Improve Docking Scoring Functions and Its Application to Drug Repurposing" , 2011, J. Chem. Inf. Model..

[6]  Gilles Marcou,et al.  Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models , 2009, J. Chem. Inf. Model..

[7]  J C Gertrudes,et al.  Machine learning techniques and drug design. , 2012, Current medicinal chemistry.

[8]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[9]  John B. O. Mitchell Machine learning methods in chemoinformatics , 2014, Wiley interdisciplinary reviews. Computational molecular science.

[10]  Peter Ertl,et al.  Bioisosteric Replacement and Scaffold Hopping in Lead Generation and Optimization , 2010, Molecular informatics.

[11]  Gerhard Klebe,et al.  Non-additivity of functional group contributions in protein-ligand binding: a comprehensive study by crystallography and isothermal titration calorimetry. , 2010, Journal of molecular biology.

[12]  Dik-Lung Ma,et al.  Drug repositioning by structure-based virtual screening. , 2013, Chemical Society reviews.

[13]  Jinyan Li,et al.  Binding Affinity Prediction for Protein-Ligand Complexes Based on β Contacts and B Factor , 2013, J. Chem. Inf. Model..

[14]  Yu Wang,et al.  A comparative study of family-specific protein–ligand complex affinity prediction based on random forest approach , 2015, Journal of Computer-Aided Molecular Design.

[15]  Yanli Wang,et al.  Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review , 2012, The AAPS Journal.

[16]  Peter Gedeck,et al.  Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets , 2010, J. Chem. Inf. Model..

[17]  Bo Yang,et al.  Integrating docking scores, interaction profiles and molecular descriptors to improve the accuracy of molecular docking: toward the discovery of novel Akt1 inhibitors. , 2014, European journal of medicinal chemistry.

[18]  Robert P. Sheridan,et al.  Using Random Forest To Model the Domain Applicability of Another Random Forest Model , 2013, J. Chem. Inf. Model..

[19]  Scott Boyer,et al.  Assessment of Machine Learning Reliability Methods for Quantifying the Applicability Domain of QSAR Regression Models , 2014, J. Chem. Inf. Model..

[20]  Piotr Zielenkiewicz,et al.  Open Drug Discovery Toolkit (ODDT): a new open-source player in the drug discovery field , 2015, Journal of Cheminformatics.

[21]  P Willett,et al.  Development and validation of a genetic algorithm for flexible docking. , 1997, Journal of molecular biology.

[22]  S. Wold,et al.  A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: Theory and algorithm , 1994 .

[23]  Kwong-Sak Leung,et al.  istar: A Web Platform for Large-Scale Protein-Ligand Docking , 2014, PloS one.

[24]  N Sukumar,et al.  Predictive cheminformatics in drug discovery: statistical modeling for analysis of micro-array and gene expression data. , 2012, Methods in molecular biology.

[25]  David S. Goodsell,et al.  AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility , 2009, J. Comput. Chem..

[26]  J Andrew McCammon,et al.  BINANA: a novel algorithm for ligand-binding characterization. , 2011, Journal of molecular graphics & modelling.

[27]  Wei Deng,et al.  Predicting Protein-Ligand Binding Affinities Using Novel Geometrical Descriptors and Machine-Learning Methods , 2004, J. Chem. Inf. Model..

[28]  Olivier Sperandio,et al.  Free resources to assist structure-based virtual ligand screening experiments. , 2007, Current protein & peptide science.

[29]  Emidio Capriotti,et al.  Bioinformatics and variability in drug response: a protein structural perspective , 2012, Journal of The Royal Society Interface.

[30]  Sebastian G. Rohrer,et al.  Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data , 2009, J. Chem. Inf. Model..

[31]  Chee Keong Kwoh,et al.  CScore: a simple yet effective scoring function for protein-ligand binding affinity prediction using modified CMAC learning architecture. , 2011, Journal of bioinformatics and computational biology.

[32]  Kwong-Sak Leung,et al.  Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest , 2015, Molecules.

[33]  H. Kitano,et al.  Combining Machine Learning Systems and Multiple Docking Simulation Packages to Improve Docking Prediction Reliability for Network Pharmacology , 2013, PloS one.

[34]  W. Janzen,et al.  High Throughput Screening , 2009, Methods in Molecular Biology.

[35]  Douglas R. Houston,et al.  Consensus Docking: Improving the Reliability of Docking in a Virtual Screening Context , 2013, J. Chem. Inf. Model..

[36]  Hege S. Beard,et al.  Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. , 2004, Journal of medicinal chemistry.

[37]  Jessica Holien,et al.  Improvements, trends, and new ideas in molecular docking: 2012–2013 in review , 2015, Journal of molecular recognition : JMR.

[38]  Gerhard Klebe,et al.  SFCscore: Scoring functions for affinity prediction of protein–ligand complexes , 2008, Proteins.

[39]  Tom L. Blundell,et al.  Does a More Precise Chemical Description of Protein–Ligand Complexes Lead to More Accurate Prediction of Binding Affinity? , 2014, J. Chem. Inf. Model..

[40]  C. Venkatachalam,et al.  LigScore: a novel scoring function for predicting binding affinities. , 2005, Journal of molecular graphics & modelling.

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Matthew P. Repasky,et al.  Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. , 2004, Journal of medicinal chemistry.

[43]  Christine Humblet,et al.  Lead optimization via high-throughput molecular docking. , 2007, Current opinion in drug discovery & development.

[44]  Wagner Meira,et al.  aCSM: noise-free graph-based signatures to large-scale receptor-based ligand prediction , 2013, Bioinform..

[45]  Philip E. Bourne,et al.  A Machine Learning-Based Method To Improve Docking Scoring Functions and Its Application to Drug Repurposing , 2011, J. Chem. Inf. Model..

[46]  Nihar R. Mahapatra,et al.  A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Brian K Shoichet,et al.  Prediction of protein-ligand interactions. Docking and scoring: successes and gaps. , 2006, Journal of medicinal chemistry.

[48]  C. E. Peishoff,et al.  A critical assessment of docking programs and scoring functions. , 2006, Journal of medicinal chemistry.

[49]  Natalia Artemenko,et al.  Distance Dependent Scoring Function for Describing Protein-Ligand Intermolecular Interactions , 2008, J. Chem. Inf. Model..

[50]  Aniko Simon,et al.  eHiTS: a new fast, exhaustive flexible ligand docking system. , 2007, Journal of molecular graphics & modelling.

[51]  Gerard J. P. van Westen,et al.  Chemical, Target, and Bioactive Properties of Allosteric Modulation , 2014, PLoS Comput. Biol..

[52]  Pedro J Ballester,et al.  Ultrafast shape recognition: method and applications. , 2011, Future medicinal chemistry.

[53]  Christoph A. Sotriffer,et al.  Scoring Functions for Protein–Ligand Interactions , 2012 .

[54]  John B. O. Mitchell,et al.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[55]  Jacob D. Durrant,et al.  Comparing Neural-Network Scoring Functions and the State of the Art: Applications to Common Library Screening , 2013, J. Chem. Inf. Model..

[56]  Kwong-Sak Leung,et al.  The Impact of Docking Pose Generation Error on the Prediction of Binding Affinity , 2014, CIBB.

[57]  Martin Frank,et al.  Computation of Binding Energies Including Their Enthalpy and Entropy Components for Protein-Ligand Complexes Using Support Vector Machines , 2013, J. Chem. Inf. Model..

[58]  C. Springer,et al.  PostDOCK: a structural, empirical approach to scoring protein ligand complexes. , 2005, Journal of medicinal chemistry.

[59]  Christoph A. Sotriffer,et al.  SFCscoreRF: A Random Forest-Based Scoring Function for Improved Affinity Prediction of Protein-Ligand Complexes , 2013, J. Chem. Inf. Model..

[60]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[61]  Xavier Morelli,et al.  GFscore: A General Nonlinear Consensus Scoring Function for High-Throughput Docking , 2006, J. Chem. Inf. Model..

[62]  Johannes C. Hermann,et al.  Structure-based activity prediction for an enzyme of unknown function , 2007, Nature.

[63]  Marcel L Verdonk,et al.  General and targeted statistical potentials for protein–ligand interactions , 2005, Proteins.

[64]  Jacob D. Durrant,et al.  NNScore: A Neural-Network-Based Scoring Function for the Characterization of Protein−Ligand Complexes , 2010, J. Chem. Inf. Model..

[65]  Pedro J. Ballester,et al.  Ultrafast shape recognition for similarity search in molecular databases , 2007, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[66]  Liwei Li,et al.  Target-Specific Support Vector Machine Scoring in Structure-Based Virtual Screening: Computational Validation, In Vitro Testing in Kinases, and Effects on Lung Cancer Cell Proliferation , 2011, J. Chem. Inf. Model..

[67]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[68]  Isidro Cortes-Ciriano,et al.  Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects , 2015 .

[69]  James M. Anderson,et al.  Experimental versus predicted affinities for ligand binding to estrogen receptor: iterative selection and rescoring of docked poses systematically improves the correlation , 2013, Journal of Computer-Aided Molecular Design.

[70]  Richard D. Smith,et al.  CSAR Benchmark Exercise of 2010: Combined Evaluation Across All Submitted Scoring Functions , 2011, J. Chem. Inf. Model..

[71]  John B. O. Mitchell,et al.  Comments on "Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets": Significance for the Validation of Scoring Functions , 2011, J. Chem. Inf. Model..

[72]  Zhiqiang Yan,et al.  Optimizing the affinity and specificity of ligand binding with the inclusion of solvation effect , 2015, Proteins.

[73]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[74]  Gisbert Schneider,et al.  Virtual screening: an endless staircase? , 2010, Nature Reviews Drug Discovery.

[75]  Xin Chen,et al.  Optimization of molecular docking scores with support vector rank regression , 2013, Proteins.

[76]  Min Zhu,et al.  Protein-Protein Binding Affinity Prediction Based on an SVR Ensemble , 2012, ICIC.

[77]  Gisbert Schneider,et al.  Machine Learning Estimates of Natural Product Conformational Energies , 2014, PLoS Comput. Biol..

[78]  Pedro J. Ballester,et al.  Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression , 2012, PRIB.

[79]  Gary B. Fogel,et al.  Computational Intelligence Methods for Docking Scores , 2009 .

[80]  Lin-Li Li,et al.  ID-Score: A New Empirical Scoring Function Based on a Comprehensive Set of Descriptors Related to Protein-Ligand Interactions , 2013, J. Chem. Inf. Model..

[81]  A. Tropsha,et al.  Development of quantitative structure-binding affinity relationship models based on novel geometrical chemical descriptors of the protein-ligand interfaces. , 2006, Journal of medicinal chemistry.

[82]  G. Klebe,et al.  Knowledge-based scoring function to predict protein-ligand interactions. , 2000, Journal of molecular biology.

[83]  Julie Clark,et al.  Discovery of Novel Antimalarial Compounds Enabled by QSAR-Based Virtual Screening , 2013, J. Chem. Inf. Model..

[84]  Igor V. Tetko,et al.  Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information , 2011, J. Comput. Aided Mol. Des..

[85]  Li Xing,et al.  Discovery of potent inhibitors of soluble epoxide hydrolase by combinatorial library design and structure-based virtual screening. , 2011, Journal of medicinal chemistry.

[86]  Ata Amini,et al.  A general approach for developing system‐specific functions to score protein–ligand docked complexes using support vector inductive logic programming , 2007, Proteins.

[87]  Zhihai Liu,et al.  Comparative Assessment of Scoring Functions on a Diverse Test Set , 2009, J. Chem. Inf. Model..

[88]  Lei Li,et al.  Improved protein-ligand binding affinity prediction by using a curvature-dependent surface-area model , 2014, Bioinform..

[89]  Jian Wang,et al.  Characterization of Small Molecule Binding. I. Accurate Identification of Strong Inhibitors in Virtual Screening , 2013, J. Chem. Inf. Model..

[90]  D. E. Clark,et al.  Outstanding challenges in protein–ligand docking and structure‐based virtual screening , 2011 .

[91]  Xiaoqin Zou,et al.  Scoring functions and their evaluation methods for protein-ligand docking: recent advances and future directions. , 2010, Physical chemistry chemical physics : PCCP.

[92]  Alexander D. MacKerell,et al.  Computational evaluation of protein-small molecule binding. , 2009, Current opinion in structural biology.

[93]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[94]  Zhiqiang Yan,et al.  Scoring Functions of Protein-Ligand Interactions , 2016 .

[95]  Pierre Baldi,et al.  A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval , 2010, Bioinform..

[96]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[97]  Jacob D. Durrant,et al.  NNScore 2.0: A Neural-Network Receptor–Ligand Scoring Function , 2011, J. Chem. Inf. Model..

[98]  M. Jacobson,et al.  Molecular mechanics methods for predicting protein-ligand binding. , 2006, Physical chemistry chemical physics : PCCP.

[99]  Sourav Das,et al.  Binding Affinity Prediction with Property-Encoded Shape Distribution Signatures , 2010, J. Chem. Inf. Model..

[100]  Wei Zhao,et al.  A statistical framework to evaluate virtual screening , 2009, BMC Bioinformatics.

[101]  Jonathan W. Essex,et al.  Water Network Perturbation in Ligand Binding: Adenosine A2A Antagonists as a Case Study , 2013, J. Chem. Inf. Model..

[102]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[103]  Kwong-Sak Leung,et al.  Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets , 2015, Molecular informatics.

[104]  Bo Wang,et al.  Support Vector Regression Scoring of Receptor-Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries , 2011, J. Chem. Inf. Model..

[105]  David L. Mobley,et al.  Let’s get honest about sampling , 2011, Journal of Computer-Aided Molecular Design.

[106]  Arthur J. Olson,et al.  AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[107]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[108]  Kwong-Sak Leung,et al.  Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study , 2014, BMC Bioinformatics.

[109]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[110]  Xiang-Qun Xie,et al.  Exploiting PubChem for virtual screening , 2010, Expert opinion on drug discovery.

[111]  William L Jorgensen,et al.  Efficient drug lead discovery and optimization. , 2009, Accounts of chemical research.

[112]  J. Pin,et al.  Virtual screening workflow development guided by the "receiver operating characteristic" curve approach. Application to high-throughput docking on metabotropic glutamate receptor subtype 4. , 2005, Journal of medicinal chemistry.