Mining Chemical Activity Status from High-Throughput Screening Assays

High-throughput screening (HTS) experiments provide a valuable resource that reports biological activity of numerous chemical compounds relative to their molecular targets. Building computational models that accurately predict such activity status (active vs. inactive) in specific assays is a challenging task given the large volume of data and frequently small proportion of active compounds relative to the inactive ones. We developed a method, DRAMOTE, to predict activity status of chemical compounds in HTP activity assays. For a class of HTP assays, our method achieves considerably better results than the current state-of-the-art-solutions. We achieved this by modification of a minority oversampling technique. To demonstrate that DRAMOTE is performing better than the other methods, we performed a comprehensive comparison analysis with several other methods and evaluated them on data from 11 PubChem assays through 1,350 experiments that involved approximately 500,000 interactions between chemicals and their target proteins. As an example of potential use, we applied DRAMOTE to develop robust models for predicting FDA approved drugs that have high probability to interact with the thyroid stimulating hormone receptor (TSHR) in humans. Our findings are further partially and indirectly supported by 3D docking results and literature information. The results based on approximately 500,000 interactions suggest that DRAMOTE has performed the best and that it can be used for developing robust virtual screening models. The datasets and implementation of all solutions are available as a MATLAB toolbox online at www.cbrc.kaust.edu.sa/dramote and can be found on Figshare.

[1]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  B. Weintraub,et al.  Thyroid-stimulating hormone and thyroid-stimulating hormone receptor structure-function relationships. , 2002, Physiological reviews.

[4]  Xia Wang,et al.  iDrug: a web-accessible and interactive drug discovery and design platform , 2014, Journal of Cheminformatics.

[5]  Hiroki Kobayashi,et al.  Integrating Statistical Predictions and Experimental Verifications for Enhancing Protein-Chemical Interaction Predictions in Virtual Screening , 2009, PLoS Comput. Biol..

[6]  Q. Su,et al.  Relationship between thyroid-stimulating hormone and blood pressure in the middle-aged and elderly population. , 2013, Singapore medical journal.

[7]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Stephen H Bryant,et al.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. , 2014, Analytica chimica acta.

[9]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Yanli Wang,et al.  Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection , 2011, J. Chem. Inf. Model..

[12]  Evan Bolton,et al.  PubChem's BioAssay Database , 2011, Nucleic Acids Res..

[13]  Yanli Wang,et al.  Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChem , 2008, BMC Bioinformatics.

[14]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[15]  Joel Dudley,et al.  Exploiting drug-disease relationships for computational drug repositioning , 2011, Briefings Bioinform..

[16]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[17]  Jiangning Song,et al.  Improving the accuracy of predicting disulfide connectivity by feature selection , 2010, J. Comput. Chem..

[18]  Xin Gao,et al.  LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone , 2014, BMC Bioinformatics.

[19]  Kris Popendorf,et al.  COPICAT: a software system for predicting interactions between proteins and chemical compounds , 2012, Bioinform..

[20]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[21]  Yanli Wang,et al.  A novel method for mining highly imbalanced high-throughput screening data in PubChem , 2009, Bioinform..

[22]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[23]  J. Parma,et al.  The thyrotropin receptor and the regulation of thyrocyte function and growth. , 1992, Endocrine reviews.

[24]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[25]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  Hanbing Rao,et al.  Identification of small molecule aggregators from large compound libraries by support vector machines , 2009, J. Comput. Chem..

[28]  Putri W. Novianti,et al.  Factors affecting the accuracy of a class prediction model in gene expression data , 2015, BMC Bioinformatics.

[29]  Paul Krause,et al.  Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity , 2014, Journal of Cheminformatics.

[30]  Amanda C. Schierz Virtual screening of bioassay data , 2009, J. Cheminformatics.

[31]  Xiang-Qun Xie,et al.  Data Mining a Small Molecule Drug Screening Representative Subset from NIH PubChem , 2008, J. Chem. Inf. Model..

[32]  Keun Ho Ryu,et al.  Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations , 2015, Journal of Cheminformatics.

[33]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[34]  B. Waeber,et al.  SC-52458, an orally active angiotensin II-receptor antagonist: inhibition of blood pressure response to angiotensin II challenges and pharmacokinetics in normal volunteers. , 1997, Journal of cardiovascular pharmacology.

[35]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[36]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[37]  T. Ashburn,et al.  Drug repositioning: identifying and developing new uses for existing drugs , 2004, Nature Reviews Drug Discovery.

[38]  Ruth Nussinov,et al.  Predicting molecular interactions in silico: II. Protein-protein and protein-drug docking. , 2003, Current medicinal chemistry.

[39]  Noel M. O'Boyle,et al.  De novo design of molecular wires with optimal properties for solar energy conversion , 2011, Journal of Cheminformatics.

[40]  Marie-Dominique Devignes,et al.  Integrative relational machine-learning for understanding drug side-effect profiles , 2013, BMC Bioinformatics.

[41]  Xiang-Qun Xie,et al.  Data Mining a Small Molecule Drug Screening Representative Subset from NIH PubChem , 2008, J. Chem. Inf. Model..

[42]  Philip S. Yu,et al.  Semi-supervised feature selection for graph classification , 2010, KDD.

[43]  Yanqing Zhang,et al.  Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction , 2006, 2006 IEEE International Conference on Granular Computing.

[44]  K. Chou,et al.  Predicting Drug-Target Interaction Networks Based on Functional Groups and Biological Features , 2010, PloS one.

[45]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  Monica Campillos,et al.  HitPick: a web server for hit identification and target prediction of chemical screenings , 2013, Bioinform..

[48]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[49]  F. Turchi,et al.  Blood pressure, thyroid-stimulating hormone, and thyroid disease prevalence in primary aldosteronism and essential hypertension. , 2011, American journal of hypertension.

[50]  Jinwoo Kim,et al.  An integrative model of multi-organ drug-induced toxicity prediction using gene-expression data , 2014, BMC Bioinformatics.

[51]  Marc C. Nicklaus,et al.  QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem , 2014, J. Chem. Inf. Model..

[52]  Markus O. Zimmermann,et al.  Validated scoring of halogen bonding in molecular design , 2014, Journal of Cheminformatics.

[53]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[54]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[55]  Christian von Mering,et al.  STITCH: interaction networks of chemicals and proteins , 2007, Nucleic Acids Res..

[56]  Erik M. van Mulligen,et al.  Recognition of chemical entities: combining dictionary-based and grammar-based approaches , 2015, Journal of Cheminformatics.

[57]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[58]  P. Murumkar,et al.  Angiotensin II receptor type 1 (AT1) selective nonpeptidic antagonists--a perspective. , 2010, Bioorganic & medicinal chemistry.

[59]  Song-Yu Yang,et al.  Roles of 17β-hydroxysteroid dehydrogenase type 10 in neurodegenerative disorders , 2014, The Journal of Steroid Biochemistry and Molecular Biology.

[60]  Xian Liu,et al.  In Silico target fishing: addressing a “Big Data” problem by ligand-based similarity rankings with data fusion , 2014, Journal of Cheminformatics.

[61]  Pieter Abbeel,et al.  Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding , 2010, 2010 IEEE International Conference on Robotics and Automation.

[62]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[63]  V. Bajic,et al.  DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm , 2015, PloS one.

[64]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[65]  Arthur J. Olson,et al.  AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[66]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[67]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[68]  Abdelhamid Bouchachia,et al.  An Empirical Investigation of Virtual Screening , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[69]  Federico Andrea Santoni,et al.  Deciphering the Code for Retroviral Integration Target Site Selection , 2010, PLoS Comput. Biol..

[70]  L. Vatten,et al.  Association between blood pressure and serum thyroid-stimulating hormone concentration within the reference range: a population-based study. , 2007, The Journal of clinical endocrinology and metabolism.