Development and rigorous validation of antimalarial predictive models using machine learning approaches

ABSTRACT The large collection of known and experimentally verified compounds from the ChEMBL database was used to build different classification models for predicting the antimalarial activity against Plasmodium falciparum. Four different machine learning methods, namely the support vector machine (SVM), random forest (RF), k-nearest neighbour (kNN) and XGBoost have been used for the development of models using the diverse antimalarial dataset from ChEMBL. A well-established feature selection framework was used to select the best subset from a larger pool of descriptors. Performance of the models was rigorously evaluated by evaluation of the applicability domain, Y-scrambling and AUC-ROC curve. Additionally, the predictive power of the models was also assessed using probability calibration and predictiveness curves. SVM and XGBoost showed the best performances, yielding an accuracy of ~85% on the independent test set. In term of probability prediction, SVM and XGBoost were well calibrated. Total gain (TG) from the predictiveness curve was more related to SVM (TG = 0.67) and XGBoost (TG = 0.75). These models also predict the high-affinity compounds from PubChem antimalarial bioassay (as external validation) with a high probability score. Our findings suggest that the selected models are robust and can be potentially useful for facilitating the discovery of antimalarial agents.

[1]  Guillaume J Filion,et al.  The signed Kolmogorov-Smirnov test: why it should not be used , 2015, GigaScience.

[2]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[3]  Amit Kumar,et al.  Development of Ligand and Structure-based classification models to design novel inhibitors against antibiotic hydrolyzing enzymes: Integration of web server , 2018, Journal of biomolecular structure & dynamics.

[4]  Ewout W Steyerberg,et al.  Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests , 2016, British Medical Journal.

[5]  Teuvo Kohonen,et al.  Essentials of the self-organizing map , 2013, Neural Networks.

[6]  Mahdi Pakdaman Naeini,et al.  Binary classifier calibration using an ensemble of piecewise linear regression models , 2017, Knowledge and Information Systems.

[7]  Rajarshi Guha,et al.  A survey of quantitative descriptions of molecular structure. , 2012, Current topics in medicinal chemistry.

[8]  Piotr F J Lipiński,et al.  SCRAMBLE’N’GAMBLE: a tool for fast and facile generation of random data for statistical evaluation of QSAR models , 2017, Chemical Papers.

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  D. Fidock,et al.  Antimalarial drug resistance: linking Plasmodium falciparum parasite biology to the clinic , 2017, Nature Medicine.

[11]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[12]  Paul D Lyne,et al.  Structure-based virtual screening: an overview. , 2002, Drug discovery today.

[13]  Alex Alves Freitas,et al.  A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood , 2016, Journal of Cheminformatics.

[14]  Ramón García-Domenech,et al.  Identification of new antimalarial drugs by linear discriminant analysis and topological virtual screening. , 2006, The Journal of antimicrobial chemotherapy.

[15]  A. Christoffels,et al.  Prioritization of anti-malarial hits from nature: chemo-informatic profiling of natural products with in vitro antiplasmodial activities and currently registered anti-malarial drugs , 2016, Malaria Journal.

[16]  George D. Magoulas,et al.  Extensions of the k Nearest Neighbour methods for classification problems , 2008 .

[17]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[18]  Yingye Zheng,et al.  Integrating the predictiveness of a marker with its performance as a classifier. , 2007, American journal of epidemiology.

[19]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[20]  Tian Zhu,et al.  Hit identification and optimization in virtual screening: practical recommendations based on a critical literature analysis. , 2013, Journal of medicinal chemistry.

[21]  Joseph L. Gastwirth,et al.  The binary regression quantile plot : Assessing the importance of predictors in binary regression visually , 2001 .

[22]  U. Bandyopadhyay,et al.  Antimalarial Activity of Small-Molecule Benzothiazole Hydrazones , 2016, Antimicrobial Agents and Chemotherapy.

[23]  Matthieu Montes,et al.  Predictiveness curves in virtual screening , 2015, Journal of Cheminformatics.

[24]  Miriam Mathea,et al.  Efficiency of different measures for defining the applicability domain of classification models , 2017, Journal of Cheminformatics.

[25]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[26]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[27]  M. Markowicz,et al.  Adaptation of High-Throughput Screening in Drug Discovery—Toxicological Screening Tests , 2011, International journal of molecular sciences.

[28]  G. Mangiatordi,et al.  Applicability Domain for QSAR models: where theory meets reality , 2016 .

[29]  Min Wang,et al.  Prediction of antibacterial compounds by machine learning approaches , 2009, J. Comput. Chem..

[30]  K. Silamut,et al.  Artemisinin resistance in Plasmodium falciparum malaria. , 2009, The New England journal of medicine.

[31]  Xiao-Hua Zhou,et al.  Partial summary measures of the predictiveness curve , 2013, Biometrical journal. Biometrische Zeitschrift.

[32]  J. Augereau,et al.  Plasmodium falciparum: multifaceted resistance to artemisinins , 2016, Malaria Journal.

[33]  A. Yan,et al.  QSAR study on the antimalarial activity of Plasmodium falciparum dihydroorotate dehydrogenase (PfDHODH) inhibitors , 2016, SAR and QSAR in environmental research.

[34]  Ziding Feng,et al.  Evaluating the Predictiveness of a Continuous Marker , 2007, Biometrics.

[35]  W. Peters Drug resistance in malaria parasites of animals and man. , 1998, Advances in parasitology.

[36]  Samuel Egieyeh,et al.  Predictive classifier models built from natural products with antimalarial bioactivity using machine learning approach , 2018, PloS one.

[37]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[38]  Andreas Zell,et al.  Kernel-based estimation of the applicability domain of QSAR models , 2010, J. Cheminformatics.

[39]  E. Ashley,et al.  Artemisinin resistance – modelling the potential human and economic costs , 2014, Malaria Journal.

[40]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[41]  David DeCaprio,et al.  Cheminformatics approaches to analyze diversity in compound screening libraries. , 2010, Current opinion in chemical biology.

[42]  Juan J Perez,et al.  Managing molecular diversity. , 2005, Chemical Society reviews.

[43]  J. Broach,et al.  High-throughput screening for drug discovery. , 1996, Nature.

[44]  György M Keseru,et al.  Hit discovery and hit-to-lead approaches. , 2006, Drug discovery today.

[45]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[46]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[47]  José L Medina-Franco,et al.  Activity Cliffs: Facts or Artifacts? , 2013, Chemical biology & drug design.