Computational models for in-vitro anti-tubercular activity of molecules based on high-throughput chemical biology screening datasets

BackgroundThe emergence of Multi-drug resistant tuberculosis in pandemic proportions throughout the world and the paucity of novel therapeutics for tuberculosis have re-iterated the need to accelerate the discovery of novel molecules with anti-tubercular activity. Though high-throughput screens for anti-tubercular activity are available, they are expensive, tedious and time-consuming to be performed on large scales. Thus, there remains an unmet need to prioritize the molecules that are taken up for biological screens to save on cost and time. Computational methods including Machine Learning have been widely employed to build classifiers for high-throughput virtual screens to prioritize molecules for further analysis. The availability of datasets based on high-throughput biological screens or assays in public domain makes computational methods a plausible proposition for building predictive models. In addition, this approach would save significantly on the cost, effort and time required to run high throughput screens.ResultsWe show that by using four supervised state-of-the-art classifiers (SMO, Random Forest, Naive Bayes and J48) we are able to generate in-silico predictive models on an extremely imbalanced (minority class ratio: 0.6%) large dataset of anti-tubercular molecules with reasonable AROC (0.6-0.75) and BCR (60-66%) values. Moreover, these models are able to provide 3-4 fold enrichment over random selection.ConclusionsIn the present study, we have used the data from in-vitro screens for anti-tubercular activity from a high-throughput screen available in public domain to build highly accurate classifiers based on molecular descriptors of the molecules. We show that Machine Learning tools can be used to build highly effective predictive models for virtual high-throughput screens to prioritize molecules from large molecular libraries.

[1]  Victor S. Sheng,et al.  Thresholding for Making Classifiers Cost-sensitive , 2006, AAAI.

[2]  Olivier Taboureau,et al.  Classification of Cytochrome P450 1A2 Inhibitors and Noninhibitors by Machine Learning Techniques , 2009, Drug Metabolism and Disposition.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Olivier Taboureau,et al.  Classification of Cytochrome P450 1A2 Inhibitors and Noninhibitors by Machine Learning Techniques , 2009, Drug Metabolism and Disposition.

[5]  Johannes Grotendorst,et al.  Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques , 2007, J. Chem. Inf. Model..

[6]  Amanda C. Schierz Virtual screening of bioassay data , 2009, J. Cheminformatics.

[7]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[8]  Jonathan D Hirst,et al.  Machine learning in virtual screening. , 2009, Combinatorial chemistry & high throughput screening.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[10]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[11]  Lynn Rasmussen,et al.  High-throughput screening for inhibitors of Mycobacterium tuberculosis H37Rv. , 2009, Tuberculosis.

[12]  Ian H. Witten,et al.  WEKA - Experiences with a Java Open-Source Project , 2010, J. Mach. Learn. Res..

[13]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[14]  Vinod Scaria,et al.  Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets , 2011, BMC Research Notes.

[15]  L. Collins,et al.  Microplate alamar blue assay versus BACTEC 460 system for high-throughput screening of compounds against Mycobacterium tuberculosis and Mycobacterium avium , 1997, Antimicrobial agents and chemotherapy.

[16]  Sean Ekins,et al.  A collaborative database and computational models for tuberculosis drug discovery. , 2010, Molecular bioSystems.

[17]  Bin Chen,et al.  PubChem BioAssays as a data source for predictive models. , 2010, Journal of molecular graphics & modelling.

[18]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[19]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[20]  Jun Feng,et al.  PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation , 2005, J. Chem. Inf. Model..

[21]  Lynn Rasmussen,et al.  Antituberculosis activity of the molecular libraries screening center network library. , 2009, Tuberculosis.

[22]  Sean Ekins,et al.  Validating New Tuberculosis Computational Models with Public Whole Cell Screening Aerobic Activity Datasets , 2011, Pharmaceutical Research.

[23]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[24]  D. Pompliano,et al.  Drugs for bad bugs: confronting the challenges of antibacterial discovery , 2007, Nature Reviews Drug Discovery.

[25]  Joel S. Freundlich,et al.  Computational databases, pathway and cheminformatics tools for tuberculosis drug discovery. , 2011, Trends in Microbiology.

[26]  S. Bryant,et al.  PubChem as a public resource for drug discovery. , 2010, Drug discovery today.

[27]  Ovidiu Ivanciuc,et al.  Weka machine learning for predicting the phospholipidosis inducing potential. , 2008, Current topics in medicinal chemistry.

[28]  Jean-Philippe Vert,et al.  Machine Learning for In Silico Virtual Screening and Chemical Genomics: New Strategies , 2008, Combinatorial chemistry & high throughput screening.

[29]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[30]  Jean-Philippe Vert,et al.  Virtual screening of GPCRs: An in silico chemogenomics approach , 2008, BMC Bioinformatics.

[31]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Sean Ekins,et al.  Analysis and hit filtering of a very large library of compounds screened against Mycobacterium tuberculosis. , 2010, Molecular bioSystems.

[34]  Robert C. Glen,et al.  Predicting Phospholipidosis Using Machine Learning , 2010, Molecular pharmaceutics.

[35]  R. Reynolds,et al.  High Throughput Screening for Inhibitors of Mycobacterium tuberculosis H 37 Rv , 2012 .