A combined drug discovery strategy based on machine learning and molecular docking

Data mining methods based on machine learning play an increasingly important role in drug design and discovery. In the current work, eight machine learning methods including decision trees, k‐Nearest neighbor, support vector machines, random forests, extremely randomized trees, AdaBoost, gradient boosting trees, and XGBoost were evaluated comprehensively through a case study of ACC inhibitor data sets. Internal and external data sets were employed for cross‐validation of the eight machine learning methods. Results showed that the extremely randomized trees model performed best and was adopted as the first step of virtual screening. Together with structure‐based virtual screening in the second step, this combined strategy obtained desirable results. This work indicates that the combination of machine learning methods with traditional structure‐based virtual screening can effectively strengthen the ability in finding potential hits from large compound database for a given target.

[1]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[2]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[3]  Daniel Svozil,et al.  FAME 2: Simple and Effective Machine Learning Model of Cytochrome P450 Regioselectivity , 2017, J. Chem. Inf. Model..

[4]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Santiago Vilar,et al.  Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery. , 2008, Current topics in medicinal chemistry.

[8]  T. Walz,et al.  Structure and function of a single-chain, multi-domain long-chain acyl-CoA carboxylase , 2014, Nature.

[9]  Bin Chen,et al.  Comparison of Random Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions , 2012, J. Chem. Inf. Model..

[10]  Ana L. Teixeira,et al.  Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons , 2013, Journal of Cheminformatics.

[11]  Hao Zhu,et al.  Design, synthesis and experimental validation of novel potential chemopreventive agents using random forest and support vector machine binary classifiers , 2014, Journal of Computer-Aided Molecular Design.

[12]  Christophe G. Lambert,et al.  Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning , 1999, J. Chem. Inf. Comput. Sci..

[13]  Robert Preissner,et al.  The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data , 2018, J. Chem. Inf. Model..

[14]  J. Waring,et al.  N-{3-[2-(4-alkoxyphenoxy)thiazol-5-yl]-1-methylprop-2-ynyl}carboxy derivatives as acetyl-coA carboxylase inhibitors--improvement of cardiovascular and neurological liabilities via structural modifications. , 2007, Journal of medicinal chemistry.

[15]  Youyong Li,et al.  ADMET Evaluation in Drug Discovery. 18. Reliable Prediction of Chemical-Induced Urinary Tract Toxicity by Boosting Machine Learning Approaches. , 2017, Molecular pharmaceutics.

[16]  José L. Medina-Franco,et al.  Consensus Diversity Plots: a global diversity analysis of chemical libraries , 2016, Journal of Cheminformatics.

[17]  L. Tong,et al.  Crystal structure of the carboxyltransferase domain of acetyl-coenzyme A carboxylase in complex with CP-640186. , 2004, Structure.

[18]  Tao Lu,et al.  An Integrated Virtual Screening Approach for VEGFR-2 Inhibitors , 2013, J. Chem. Inf. Model..

[19]  Sathesh Bhat,et al.  Inhibition of acetyl-CoA carboxylase suppresses fatty acid synthesis and tumor growth of non-small cell lung cancer in preclinical models , 2016, Nature Medicine.

[20]  R. Leardi Genetic algorithms in chemometrics and chemistry: a review , 2001 .

[21]  Huikun Zhang,et al.  Machine Learning Consensus Scoring Improves Performance Across Targets in Structure-Based Virtual Screening , 2017, J. Chem. Inf. Model..

[22]  Balachandran Manavalan,et al.  Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. , 2018, Journal of proteome research.

[23]  Bieke Dejaegher,et al.  Feature selection methods in QSAR studies. , 2012, Journal of AOAC International.

[24]  M. Lane,et al.  Acetyl coenzyme A carboxylase system of Escherichia coli. Studies on the mechanisms of the biotin carboxylase- and carboxyltransferase-catalyzed reactions. , 1974, The Journal of biological chemistry.

[25]  Sean Ekins,et al.  Comparing and Validating Machine Learning Models for Mycobacterium tuberculosis Drug Discovery. , 2018, Molecular pharmaceutics.

[26]  Christoph Helma,et al.  Classification of cytochrome p(450) activities using machine learning methods. , 2009, Molecular pharmaceutics.

[27]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[28]  Didier Rognan,et al.  IChemPIC: A Random Forest Classifier of Biological and Crystallographic Protein-Protein Interfaces , 2015, J. Chem. Inf. Model..

[29]  M. P. Bourbeau,et al.  Recent advances in the development of acetyl-CoA carboxylase (ACC) inhibitors for the treatment of metabolic disease. , 2015, Journal of medicinal chemistry.

[30]  C. L. Mallows Some comments on C_p , 1973 .

[31]  Onat Kadioglu,et al.  A Machine Learning-Based Prediction Platform for P-Glycoprotein Modulators and Its Validation by Molecular Docking , 2019, Cells.

[32]  Wojciech Czarnecki,et al.  Robust optimization of SVM hyperparameters in the classification of bioactive compounds , 2015, Journal of Cheminformatics.

[33]  W. Wooster,et al.  Crystal structure of , 2005 .

[34]  G. Heinze,et al.  Augmented Backward Elimination: A Pragmatic and Purposeful Way to Develop Statistical Models , 2014, PloS one.

[35]  J. Friedman Stochastic gradient boosting , 2002 .

[36]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[37]  Shuai Lu,et al.  Fragment virtual screening based on Bayesian categorization for discovering novel VEGFR-2 scaffolds , 2015, Molecular Diversity.

[38]  S. Rigatti Random Forest. , 2017, Journal of insurance medicine.

[39]  R. Brereton Chemometrics , 2018, Chemometrics and Cheminformatics in Aquatic Toxicology.

[40]  E. Maser,et al.  Targeting acetyl-CoA carboxylases: small molecular inhibitors and their therapeutic potential. , 2012, Recent patents on anti-cancer drug discovery.

[41]  Y Z Chen,et al.  Virtual screening of selective multitarget kinase inhibitors by combinatorial support vector machines. , 2010, Molecular pharmaceutics.

[42]  Roberto Todeschini,et al.  Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions , 2013, Journal of Cheminformatics.

[43]  Tingjun Hou,et al.  Drug-likeness analysis of traditional Chinese medicines: prediction of drug-likeness using machine learning approaches. , 2012, Molecular pharmaceutics.

[44]  Xinyi Huang,et al.  Acetyl-CoA carboxylase inhibition by ND-630 reduces hepatic steatosis, improves insulin sensitivity, and modulates dyslipidemia in rats , 2016, Proceedings of the National Academy of Sciences.

[45]  Yanmin Zhang,et al.  Discovery of Novel Potent VEGFR-2 Inhibitors Exerting Significant Antiproliferative Activity against Cancer Cell Lines. , 2018, Journal of medicinal chemistry.

[46]  J. Ribeiro,et al.  The Differences Between Individuals Engaging in Nonsuicidal Self-Injury and Suicide Attempt Are Complex (vs. Complicated or Simple) , 2020, Frontiers in Psychiatry.

[47]  Ann Nowé,et al.  GA(M)E-QSAR: A Novel, Fully Automatic Genetic-Algorithm-(Meta)-Ensembles Approach for Binary Classification in Ligand-Based Drug Design , 2012, J. Chem. Inf. Model..

[48]  Björn Wallner,et al.  Proteus: a random forest classifier to predict disorder-to-order transitioning binding regions in intrinsically disordered proteins , 2016, bioRxiv.

[49]  Andy Liaw,et al.  Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships , 2016, J. Chem. Inf. Model..

[50]  Ting Wang,et al.  Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling , 2005, J. Chem. Inf. Model..

[51]  Jonas Boström,et al.  Assessing the performance of OMEGA with respect to retrieving bioactive conformations. , 2003, Journal of molecular graphics & modelling.

[52]  C. Fenselau,et al.  Acetyl coenzyme A carboxylase system of Escherichia coli. Site of carboxylation of biotin and enzymatic reactivity of 1'-N-(ureido)-carboxybiotin derivatives. , 1974, The Journal of biological chemistry.

[53]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[54]  Hongbin Yang,et al.  Multiclassification Prediction of Enzymatic Reactions for Oxidoreductases and Hydrolases Using Reaction Fingerprints and Machine Learning Methods , 2018, J. Chem. Inf. Model..

[55]  Miriam Mathea,et al.  Efficiency of different measures for defining the applicability domain of classification models , 2017, Journal of Cheminformatics.

[56]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..

[57]  J. McGarry,et al.  Regulation of hepatic fatty acid oxidation and ketone body production. , 1980, Annual review of biochemistry.

[58]  E. Ravussin,et al.  Decreasing the Rate of Metabolic Ketone Reduction in the Discovery of a Clinical Acetyl-CoA Carboxylase Inhibitor for the Treatment of Diabetes , 2014, Journal of medicinal chemistry.

[59]  Vili Podgorelec,et al.  Decision Trees: An Overview and Their Use in Medicine , 2002, Journal of Medical Systems.

[60]  A. Konagurthu,et al.  MUSTANG: A multiple structural alignment algorithm , 2006, Proteins.