Target prediction utilising negative bioactivity data covering large chemical space

BackgroundIn silico analyses are increasingly being used to support mode-of-action investigations; however many such approaches do not utilise the large amounts of inactive data held in chemogenomic repositories. The objective of this work is concerned with the integration of such bioactivity data in the target prediction of orphan compounds to produce the probability of activity and inactivity for a range of targets. To this end, a novel human bioactivity data set was constructed through the assimilation of over 195 million bioactivity data points deposited in the ChEMBL and PubChem repositories, and the subsequent application of a sphere-exclusion selection algorithm to oversample presumed inactive compounds.ResultsA Bernoulli Naïve Bayes algorithm was trained using the data and evaluated using fivefold cross-validation, achieving a mean recall and precision of 67.7 and 63.8 % for active compounds and 99.6 and 99.7 % for inactive compounds, respectively. We show the performances of the models are considerably influenced by the underlying intraclass training similarity, the size of a given class of compounds, and the degree of additional oversampling. The method was also validated using compounds extracted from WOMBAT producing average precision-recall AUC and BEDROC scores of 0.56 and 0.85, respectively. Inactive data points used for this test are based on presumed inactivity, producing an approximated indication of the true extrapolative ability of the models. A distance-based applicability domain analysis was also conducted; indicating an average Tanimoto Coefficient distance of 0.3 or greater between a test and training set can be used to give a global measure of confidence in model predictions. A final comparison to a method trained solely on active data from ChEMBL performed with precision-recall AUC and BEDROC scores of 0.45 and 0.76.ConclusionsThe inclusion of inactive data for model training produces models with superior AUC and improved early recognition capabilities, although the results from internal and external validation of the models show differing performance between the breadth of models. The realised target prediction protocol is available at https://github.com/lhm30/PIDGIN.Graphical abstractThe inclusion of large scale negative training data for in silico target prediction improves the precision and recall AUC and BEDROC scores for target models.

[1]  C. Anfinsen,et al.  Selective enzyme purification by affinity chromatography. , 1968, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Brian D. Hudson,et al.  Parameter Based Methods for Compound Selection from Chemical Databases , 1996 .

[3]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  Vladimir Poroikov,et al.  PASS: prediction of activity spectra for biologically active substances , 2000, Bioinform..

[6]  V. Poroikov,et al.  Top 200 Medicines: Can New Actions be Discovered Through Computer-aided Prediction? , 2001, SAR and QSAR in environmental research.

[7]  Minoru Kanehisa,et al.  The KEGG database. , 2002, Novartis Foundation symposium.

[8]  C. Ung,et al.  Can an in silico drug-target search method be used to probe potential mechanisms of medicinal plant ingredients? , 2003, Natural product reports.

[9]  Man-Ling Lee,et al.  DISE: Directed Sphere Exclusion , 2003, J. Chem. Inf. Comput. Sci..

[10]  Pierre Acklin,et al.  Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins , 2003, J. Chem. Inf. Comput. Sci..

[11]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[12]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[13]  L. Burdine,et al.  Target identification in chemical genetics: the (often) missing link. , 2004, Chemistry & biology.

[14]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[15]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[16]  Karl-Michael Schneider On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification , 2004, EsTAL.

[17]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[18]  Jane Lomax,et al.  Get ready to GO! A biologist's guide to the Gene Ontology , 2005, Briefings Bioinform..

[19]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[20]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[21]  Tudor I. Oprea,et al.  WOMBAT: World of Molecular Bioactivity , 2005 .

[22]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[23]  Meir Glick,et al.  Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases , 2006, J. Chem. Inf. Model..

[24]  Ajay N. Jain,et al.  Robust ligand-based modeling of the biological targets of known drugs. , 2006, Journal of medicinal chemistry.

[25]  A. Bender,et al.  Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. , 2006, IDrugs : the investigational drugs journal.

[26]  A. Bender,et al.  In silico target fishing: Predicting biological targets from chemical structure , 2006 .

[27]  Z. Deng,et al.  Bridging chemical and biological space: "target fishing" using 2D and 3D molecular descriptors. , 2006, Journal of medicinal chemistry.

[28]  Gavin Harper,et al.  Training Similarity Measures for Specific Activities: Application to Reduced Graphs , 2006, J. Chem. Inf. Model..

[29]  Yi Wang,et al.  In silico search of putative adverse drug reaction related proteins as a potential tool for facilitating drug adverse effect prediction. , 2006, Toxicology letters.

[30]  G. Terstappen,et al.  Target deconvolution strategies in drug discovery , 2007, Nature Reviews Drug Discovery.

[31]  P. Clemons,et al.  Chemogenomic data analysis: prediction of small-molecule targets and the advent of biological fingerprint. , 2007, Combinatorial chemistry & high throughput screening.

[32]  Xiaomin Luo,et al.  PDTD: a web-accessible protein database for drug target identification , 2008, BMC Bioinformatics.

[33]  Dariusz Plewczynski,et al.  Target specific compound identification using a support vector machine. , 2007, Combinatorial chemistry & high throughput screening.

[34]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[35]  A. Bender,et al.  Analysis of Pharmacology Data and the Prediction of Adverse Drug Reactions and Off‐Target Effects from Chemical Structure , 2007, ChemMedChem.

[36]  Shane Weaver,et al.  The importance of the domain of applicability in QSAR modeling. , 2008, Journal of molecular graphics & modelling.

[37]  Andreas Bender,et al.  Ligand-Target Prediction Using Winnow and Naive Bayesian Algorithms and the Implications of Overall Performance Statistics , 2008, J. Chem. Inf. Model..

[38]  Tudor I. Oprea,et al.  WOMBAT and WOMBAT‐PK: Bioactivity Databases for Lead and Drug Discovery , 2008 .

[39]  Jean-Philippe Vert,et al.  Virtual screening of GPCRs: An in silico chemogenomics approach , 2008, BMC Bioinformatics.

[40]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[41]  J. Mestres,et al.  A ligand-based approach to mining the chemogenomic space of drugs. , 2008, Combinatorial chemistry & high throughput screening.

[42]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..

[43]  John A. Tallarico,et al.  Use of ligand based models for protein domains to predict novel molecular targets and applications to triage affinity chromatography data. , 2009, Journal of proteome research.

[44]  John A. Tallarico,et al.  Multi-parameter phenotypic profiling: using cellular effects to characterize small-molecule compounds , 2009, Nature Reviews Drug Discovery.

[45]  G. Superti-Furga,et al.  Target profiling of small molecules by chemical proteomics. , 2009, Nature chemical biology.

[46]  George Karypis,et al.  Target Fishing for Chemical Compounds Using Target-Ligand Activity Data and Ranking Based Methods , 2009, J. Chem. Inf. Model..

[47]  Michael J. Keiser,et al.  Predicting new molecular targets for known drugs , 2009, Nature.

[48]  Alice McCarthy The NIH Molecular Libraries Program: identifying chemical probes for new medicines. , 2010, Chemistry & biology.

[49]  Didier Rognan,et al.  Structure‐Based Approaches to Target Fishing and Ligand Profiling , 2010, Molecular informatics.

[50]  Zhiyong Lu,et al.  Database resources of the National Center for Biotechnology Information , 2010, Nucleic Acids Res..

[51]  Michael J. Keiser,et al.  Prediction and evaluation of protein farnesyltransferase inhibition by commercial drugs. , 2010, Journal of medicinal chemistry.

[52]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[53]  Andreas Bender,et al.  From in silico target prediction to multi-target drug design: current databases, methods and applications. , 2011, Journal of proteomics.

[54]  I. Sushko,et al.  Applicability Domain of QSAR models , 2011 .

[55]  M. Raida Drug target deconvolution by chemical proteomics. , 2011, Current opinion in chemical biology.

[56]  M. Kanehisa,et al.  Using the KEGG Database Resource , 2005, Current protocols in bioinformatics.

[57]  Lirong Wang,et al.  TargetHunter: An In Silico Target Identification Tool for Predicting Therapeutic Potential of Small Organic Molecules Based on Chemogenomic Database , 2013, The AAPS Journal.

[58]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[59]  B. Kuster,et al.  Mass spectrometry-based proteomics in preclinical drug discovery. , 2012, Chemistry & biology.

[60]  Michael J. Keiser,et al.  Large Scale Prediction and Testing of Drug Activity on Side-Effect Targets , 2012, Nature.

[61]  G. Thallinger,et al.  A Sequence Based Validation of Gene Expression Microarray Data , 2012 .

[62]  M. Bogyo,et al.  Target deconvolution techniques in modern phenotypic profiling. , 2013, Current opinion in chemical biology.

[63]  Andreas Bender,et al.  In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window , 2013, J. Chem. Inf. Model..

[64]  P. Clemons,et al.  Target identification and mechanism of action in chemical biology and drug discovery. , 2013, Nature chemical biology.

[65]  Evan Bolton,et al.  PubChem3D: conformer ensemble accuracy , 2013, Journal of Cheminformatics.

[66]  Andrzej J. Bojarski,et al.  The influence of the inactives subset generation on the performance of machine learning methods , 2013, Journal of Cheminformatics.

[67]  Robert C. Glen,et al.  Quantifying the shifts in physicochemical property space introduced by the metabolism of small organic molecules , 2013, Journal of Cheminformatics.

[68]  Andreas Bender,et al.  Using machine learning techniques for rationalising phenotypic readouts from a rat sleeping model , 2013, Journal of Cheminformatics.

[69]  Andrzej J. Bojarski,et al.  The influence of negative training set size on machine learning-based virtual screening , 2014, Journal of Cheminformatics.

[70]  G. Maggiora,et al.  Molecular similarity in medicinal chemistry. , 2014, Journal of medicinal chemistry.

[71]  R. Glen,et al.  Extending in silico mechanism-of-action analysis by annotating targets with pathways: application to cellular cytotoxicity readouts. , 2014, Future medicinal chemistry.

[72]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[73]  Andreas Bender,et al.  Comparing global and local likelihood score thresholds in multiclass laplacian-modified Naive Bayes protein target prediction. , 2015, Combinatorial chemistry & high throughput screening.

[74]  Eric Sayers,et al.  The E-utilities In-Depth: Parameters, Syntax and More , 2015 .

[75]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.