Fusing Dual-Event Data Sets for Mycobacterium tuberculosis Machine Learning Models and Their Evaluation

The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multidrug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external data set. A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine, or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual data sets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three data sets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest 5-fold cross-validation ROC scores can outperform other models in a test set dependent manner. We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Data set fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.

[1]  Vinod Scaria,et al.  Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets , 2011, BMC Research Notes.

[2]  David Beer,et al.  A High-Throughput Screen To Identify Inhibitors of ATP Homeostasis in Non-replicating Mycobacterium tuberculosis , 2012, ACS chemical biology.

[3]  Ying Zhang,et al.  The magic bullets and tuberculosis drug targets. , 2005, Annual review of pharmacology and toxicology.

[4]  Egon L. Willighagen,et al.  Towards interoperable and reproducible QSAR analyses: Exchange of datasets , 2010, J. Cheminformatics.

[5]  D. Bojanic,et al.  Impact of high-throughput screening in biomedical research , 2011, Nature Reviews Drug Discovery.

[6]  A. Bender,et al.  Analysis of Pharmacology Data and the Prediction of Adverse Drug Reactions and Off‐Target Effects from Chemical Structure , 2007, ChemMedChem.

[7]  Eric Arnoult,et al.  The challenge of new drug discovery for tuberculosis , 2011, Nature.

[8]  I. Orme,et al.  Comprehensive analysis of methods used for the evaluation of compounds against Mycobacterium tuberculosis. , 2012, Tuberculosis.

[9]  Vinod Scaria,et al.  Computational models for in-vitro anti-tubercular activity of molecules based on high-throughput chemical biology screening datasets , 2012, BMC pharmacology.

[10]  Pedro M Alzari,et al.  Rising standards for tuberculosis drug development. , 2008, Trends in pharmacological sciences.

[11]  Lynn Rasmussen,et al.  High-throughput screening for inhibitors of Mycobacterium tuberculosis H37Rv. , 2009, Tuberculosis.

[12]  Peter Willett,et al.  Combination of Similarity Rankings Using Data Fusion , 2013, J. Chem. Inf. Model..

[13]  Eszter Hazai,et al.  Predicting P-Glycoprotein-Mediated Drug Transport Based On Support Vector Machine and Three-Dimensional Crystal Structure of P-glycoprotein , 2011, PloS one.

[14]  Lynn Rasmussen,et al.  High throughput screening of a library based on kinase inhibitor scaffolds against Mycobacterium tuberculosis H37Rv. , 2012, Tuberculosis.

[15]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[16]  Sean Ekins,et al.  A collaborative database and computational models for tuberculosis drug discovery. , 2010, Molecular bioSystems.

[17]  Rieko Arimoto,et al.  Development of CYP3A4 Inhibition Models: Comparisons of Machine-Learning Techniques and Molecular Descriptors , 2005, Journal of biomolecular screening.

[18]  Sean Ekins,et al.  When pharmaceutical companies publish large datasets: an abundance of riches or fool's gold? , 2010, Drug discovery today.

[19]  Kathrin Heikamp,et al.  Comparison of Confirmed Inactive and Randomly Selected Compounds as Negative Training Examples in Support Vector Machine-Based Virtual Screening , 2013, J. Chem. Inf. Model..

[20]  Sean Ekins,et al.  Combining Cheminformatics Methods and Pathway Analysis to Identify Molecules with Whole-Cell Activity Against Mycobacterium Tuberculosis , 2012, Pharmaceutical Research.

[21]  Christoph Helma,et al.  Classification of cytochrome p(450) activities using machine learning methods. , 2009, Molecular pharmaceutics.

[22]  Sean Ekins,et al.  Computational models for tuberculosis drug discovery. , 2013, Methods in molecular biology.

[23]  Ola Spjuth,et al.  Open source drug discovery with bioclipse. , 2012, Current topics in medicinal chemistry.

[24]  Feixiong Cheng,et al.  In silico Prediction of Chemical Ames Mutagenicity , 2012, J. Chem. Inf. Model..

[25]  Robert A. Field,et al.  New Small-Molecule Synthetic Antimycobacterials , 2005, Antimicrobial Agents and Chemotherapy.

[26]  Sean Ekins,et al.  Integrated in silico-in vitro strategy for addressing cytochrome P450 3A4 time-dependent inhibition. , 2010, Chemical research in toxicology.

[27]  Philip Prathipati,et al.  Global Bayesian Models for the Prioritization of Antitubercular Agents , 2008, J. Chem. Inf. Model..

[28]  Alexander Tropsha,et al.  Chembench: a cheminformatics workbench , 2010, Bioinform..

[29]  Damiano Banfi,et al.  Leads for antitubercular compounds from kinase inhibitor library screens. , 2010, Tuberculosis.

[30]  Ivan Rusyn,et al.  Modeling liver-related adverse effects of drugs using knearest neighbor quantitative structure-activity relationship method. , 2010, Chemical research in toxicology.

[31]  Sarah R. Langdon,et al.  Predicting cytotoxicity from heterogeneous data sources with Bayesian learning , 2010, J. Cheminformatics.

[32]  Lynn Rasmussen,et al.  Antituberculosis activity of the molecular libraries screening center network library. , 2009, Tuberculosis.

[33]  Sean Ekins,et al.  Validating New Tuberculosis Computational Models with Public Whole Cell Screening Aerobic Activity Datasets , 2011, Pharmaceutical Research.

[34]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Raimund Mannhold,et al.  QSAR modeling and data mining link Torsades de Pointes risk to the interplay of extent of metabolism, active transport, and HERG liability. , 2012, Molecular pharmaceutics.

[36]  Anthony E. Klon,et al.  Improved Naïve Bayesian Modeling of Numerical Data for Absorption, Distribution, Metabolism and Excretion (ADME) Property Prediction , 2006, J. Chem. Inf. Model..

[37]  Indira Ghosh,et al.  Developing an Antituberculosis Compounds Database and Data Mining in the Search of a Motif Responsible for the Activity of a Diverse Class of Antituberculosis Agents , 2006, J. Chem. Inf. Model..

[38]  Ramón García-Domenech,et al.  Search of Chemical Scaffolds for Novel Antituberculosis Agents , 2005, Journal of biomolecular screening.

[39]  Sean Ekins,et al.  Computational Models for Neglected Diseases: Gaps and Opportunities , 2013, Pharmaceutical Research.

[40]  Hanna Geppert,et al.  Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation , 2010, J. Chem. Inf. Model..

[41]  Enrique Molina Pérez,et al.  Design of novel antituberculosis compounds using graph-theoretical and substructural approaches , 2009, Molecular Diversity.

[42]  Bernd Beck,et al.  A support vector machine approach to classify human cytochrome P450 3A4 inhibitors , 2005, J. Comput. Aided Mol. Des..

[43]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[44]  Marlene T. Kim,et al.  Predicting chemical ocular toxicity using a combinatorial QSAR approach. , 2012, Chemical research in toxicology.

[45]  Franco Lombardo,et al.  A hybrid mixture discriminant analysis-random forest computational model for the prediction of volume of distribution of drugs in human. , 2006, Journal of medicinal chemistry.

[46]  D. Pompliano,et al.  Drugs for bad bugs: confronting the challenges of antibacterial discovery , 2007, Nature Reviews Drug Discovery.

[47]  Barry A. Bunin,et al.  Bayesian models leveraging bioactivity and cytotoxicity information for drug discovery. , 2013, Chemistry & biology.

[48]  D. Rogers,et al.  Using Extended-Connectivity Fingerprints with Laplacian-Modified Bayesian Analysis in High-Throughput Screening Follow-Up , 2005, Journal of biomolecular screening.

[49]  Joel S. Freundlich,et al.  Computational databases, pathway and cheminformatics tools for tuberculosis drug discovery. , 2011, Trends in Microbiology.

[50]  Sean Ekins,et al.  Computational mapping tools for drug discovery. , 2009, Drug discovery today.

[51]  Hinrich W. H. Göhlmann,et al.  A Diarylquinoline Drug Active on the ATP Synthase of Mycobacterium tuberculosis , 2005, Science.

[52]  Sean Ekins,et al.  Classification of Metabolites with Kernel-Partial Least Squares (K-PLS) , 2007, Drug Metabolism and Disposition.

[53]  Dharmaranjan Sriram,et al.  Enhanced ranking of PknB Inhibitors using data fusion methods , 2013, Journal of Cheminformatics.

[54]  Sean Ekins,et al.  Identification and Validation of Novel Human Pregnane X Receptor Activators among Prescribed Drugs via Ligand-Based Virtual Screening , 2011, Drug Metabolism and Disposition.

[55]  James R. Brown,et al.  Thousands of chemical starting points for antimalarial lead identification , 2010, Nature.

[56]  Sean Ekins,et al.  Using Open Source Computational Tools for Predicting Human Metabolic Stability and Additional Absorption, Distribution, Metabolism, Excretion, and Toxicity Properties , 2010, Drug Metabolism and Disposition.

[57]  Kuo-Chen Chou,et al.  Support vector machines for the classification and prediction of β‐turn types , 2002, Journal of peptide science : an official publication of the European Peptide Society.

[58]  Sean Ekins,et al.  A Predictive Ligand-Based Bayesian Model for Human Drug-Induced Liver Injury , 2010, Drug Metabolism and Disposition.

[59]  Sean Ekins,et al.  Molecular Determinants of Ligand Selectivity for the Human Multidrug and Toxin Extruder Proteins MATE1 and MATE2-K , 2012, Journal of Pharmacology and Experimental Therapeutics.

[60]  Sean Ekins,et al.  Novel web-based tools combining chemistry informatics, biology and social networks for drug discovery. , 2009, Drug discovery today.

[61]  Sean Ekins,et al.  Structure-activity relationship for FDA approved drugs as inhibitors of the human sodium taurocholate cotransporting polypeptide (NTCP). , 2013, Molecular pharmaceutics.

[62]  Sean Ekins,et al.  The Collaborative Drug Discovery (CDD) database. , 2013, Methods in molecular biology.

[63]  Praveen M. Bahadduri,et al.  Rapid Identification of P-glycoprotein Substrates and Inhibitors , 2006, Drug Metabolism and Disposition.

[64]  T. Alber,et al.  Depletion of antibiotic targets has widely varying effects on growth , 2011, Proceedings of the National Academy of Sciences.

[65]  Pierre Baldi,et al.  A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval , 2010, Bioinform..

[66]  David Rogers,et al.  Cheminformatics analysis and learning in a data pipelining environment , 2006, Molecular Diversity.

[67]  Chin Yee Liew,et al.  Mixed learning algorithms and features ensemble in hepatotoxicity prediction , 2011, J. Comput. Aided Mol. Des..

[68]  Noriaki Iwase,et al.  Identification of novel inhibitors of M. tuberculosis growth using whole cell based high-throughput screening. , 2012, ACS chemical biology.

[69]  Barry A. Bunin,et al.  Chemical Space: Missing Pieces in Cheminformatics , 2010, Pharmaceutical Research.

[70]  Sean Ekins,et al.  Computational Approaches That Predict Metabolic Intermediate Complex Formation with CYP3A4 (+b5) , 2007, Drug Metabolism and Disposition.

[71]  Klaus-Robert Müller,et al.  Benchmark Data Set for in Silico Prediction of Ames Mutagenicity , 2009, J. Chem. Inf. Model..

[72]  Marianne Terrot,et al.  Combinatorial lead optimization of [1,2]-diamines based on ethambutol as potential antituberculosis preclinical candidates. , 2003, Journal of combinatorial chemistry.

[73]  Sean Ekins,et al.  Combining Computational Methods for Hit to Lead Optimization in Mycobacterium Tuberculosis Drug Discovery , 2013, Pharmaceutical Research.

[74]  Sean Ekins,et al.  Analysis and hit filtering of a very large library of compounds screened against Mycobacterium tuberculosis. , 2010, Molecular bioSystems.

[75]  Alex M. Clark,et al.  TB Mobile: a mobile app for anti-tuberculosis molecules with known targets , 2013, Journal of Cheminformatics.

[76]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[77]  Sean Ekins,et al.  Enhancing Hit Identification in Mycobacterium tuberculosis Drug Discovery Using Validated Dual-Event Bayesian Models , 2013, PloS one.

[78]  Egon L. Willighagen,et al.  Bioclipse: an open source workbench for chemo- and bioinformatics , 2007, BMC Bioinformatics.

[79]  Alfonso Mendoza,et al.  Fueling Open-Source Drug Discovery: 177 Small-Molecule Leads against Tuberculosis , 2013, ChemMedChem.