Getting the most out of PubChem for virtual screening

ABSTRACT Introduction: With the emergence of the ‘big data’ era, the biomedical research community has great interest in exploiting publicly available chemical information for drug discovery. PubChem is an example of public databases that provide a large amount of chemical information free of charge. Areas covered: This article provides an overview of how PubChem’s data, tools, and services can be used for virtual screening and reviews recent publications that discuss important aspects of exploiting PubChem for drug discovery. Expert opinion: PubChem offers comprehensive chemical information useful for drug discovery. It also provides multiple programmatic access routes, which are essential to build automated virtual screening pipelines that exploit PubChem data. In addition, PubChemRDF allows users to download PubChem data and load them into a local computing facility, facilitating data integration between PubChem and other resources. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). These studies demonstrate the usefulness of PubChem as a key resource for computer-aided drug discovery and related area.

[1]  Chunyan Tan,et al.  A Two-Step Target Binding and Selectivity Support Vector Machines Approach for Virtual Screening of Dopamine Receptor Subtype-Selective Ligands , 2012, PloS one.

[2]  Rajarshi Guha,et al.  Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays , 2008, J. Comput. Aided Mol. Des..

[3]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[4]  Paul Watson,et al.  Virtual Screening Using Protein-Ligand Docking: Avoiding Artificial Enrichment , 2004, J. Chem. Inf. Model..

[5]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[6]  Pankaj Agarwal,et al.  Combined Analysis of Phenotypic and Target-Based Screening in Assay Networks , 2014, Journal of biomolecular screening.

[7]  P. Hawkins,et al.  Comparison of shape-matching and docking as virtual screening tools. , 2007, Journal of medicinal chemistry.

[8]  Christopher P Austin,et al.  High-throughput screening assays for the identification of chemical probes. , 2007, Nature chemical biology.

[9]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[10]  Victor S. Sheng,et al.  Cost-Sensitive Learning , 2009, Encyclopedia of Data Warehousing and Mining.

[11]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[12]  Hao Zhu,et al.  Big Data in Chemical Toxicity Research: The Use of High-Throughput Screening Assays To Identify Potential Toxicants , 2014, Chemical research in toxicology.

[13]  Yanli Wang,et al.  A novel method for mining highly imbalanced high-throughput screening data in PubChem , 2009, Bioinform..

[14]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[15]  Peter Groves,et al.  International Patent Classification (IPC) , 2011 .

[16]  Jie Li,et al.  PDB-wide collection of binding data: current status of the PDBbind database , 2015, Bioinform..

[17]  Bo-Han Su,et al.  In Silico Binary Classification QSAR Models Based on 4D-Fingerprints and MOE Descriptors for Prediction of hERG Blockage , 2010, J. Chem. Inf. Model..

[18]  David S. Wishart,et al.  HMDB 3.0—The Human Metabolome Database in 2013 , 2012, Nucleic Acids Res..

[19]  Yanli Wang,et al.  PubChem BioAssay: 2014 update , 2013, Nucleic Acids Res..

[20]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[21]  A. Butte,et al.  Predicting Adverse Drug Reactions Using Publicly Available PubChem BioAssay Data , 2011, Clinical pharmacology and therapeutics.

[22]  Om Prakash Sharma,et al.  Identification of novel tyrosine kinase inhibitors for drug resistant T315I mutant BCR-ABL: a virtual screening and molecular dynamics simulations study , 2014, Scientific Reports.

[23]  Evan Bolton,et al.  PubChem3D: a new resource for scientists , 2011, J. Cheminformatics.

[24]  Evan Bolton,et al.  PubChem3D: Similar conformers , 2011, J. Cheminformatics.

[25]  Anuraj Nayarisseri,et al.  Multiclass comparative virtual screening to identify novel Hsp90 inhibitors: a therapeutic breast cancer drug target. , 2015, Current topics in medicinal chemistry.

[26]  Y Z Chen,et al.  Virtual screening of selective multitarget kinase inhibitors by combinatorial support vector machines. , 2010, Molecular pharmaceutics.

[27]  Nitin S. Sapre,et al.  A group center overlap based approach for “3D QSAR” studies on TIBO derivatives , 2009, J. Comput. Chem..

[28]  Stephen R. Heller,et al.  InChI, the IUPAC International Chemical Identifier , 2015, Journal of Cheminformatics.

[29]  Lei Yang,et al.  Classification of Cytochrome P450 Inhibitors and Noninhibitors Using Combined Classifiers , 2011, J. Chem. Inf. Model..

[30]  Marc C. Nicklaus,et al.  QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem , 2014, J. Chem. Inf. Model..

[31]  J Polanski,et al.  Pharmacophore-based database mining for probing fragmental drug-likeness of diketo acid analogues , 2012, SAR and QSAR in environmental research.

[32]  Stephen H Bryant,et al.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. , 2014, Analytica chimica acta.

[33]  M. Pistelli,et al.  Clinical Evidence for Three Distinct Gastric Cancer Subtypes: Time for a New Approach , 2013, PloS one.

[34]  Emilio Xavier Esposito,et al.  Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods , 2013, J. Chem. Inf. Model..

[35]  George C. Fonger,et al.  The National Library of Medicine's (NLM) Hazardous Substances Data Bank (HSDB): background, recent enhancements and future plans. , 2014, Toxicology.

[36]  Remigijus Didziapetris,et al.  Trainable structure–activity relationship model for virtual screening of CYP3A4 inhibition , 2010, J. Comput. Aided Mol. Des..

[37]  Jens Meiler,et al.  Benchmarking Ligand-Based Virtual High-Throughput Screening with the PubChem Database , 2013, Molecules.

[38]  Z. R. Li,et al.  A support vector machines approach for virtual screening of active compounds of single and multiple mechanisms from large libraries at an improved hit-rate and enrichment factor. , 2008, Journal of molecular graphics & modelling.

[39]  Jiansong Fang,et al.  A New Protocol for Predicting Novel GSK-3β ATP Competitive Inhibitors , 2011, J. Chem. Inf. Model..

[40]  Shun-Ya Chang,et al.  Experimentally Validated Novel Inhibitors of Helicobacter pylori Phosphopantetheine Adenylyltransferase Discovered by Virtual High-Throughput Screening , 2013, PloS one.

[41]  Varun Khanna,et al.  In silico approach to screen compounds active against parasitic nematodes of major socio-economic importance , 2011, BMC Bioinformatics.

[42]  Michael K. Gilson,et al.  BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology , 2015, Nucleic Acids Res..

[43]  Michael M. Mysinger,et al.  Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking , 2012, Journal of medicinal chemistry.

[44]  Arnold I. Caplan,et al.  Scaling-Up of Dental Pulp Stem Cells Isolated from Multiple Niches , 2012, PloS one.

[45]  Ivan Rusyn,et al.  The Use of Cell Viability Assay Data Improves the Prediction Accuracy of Conventional Quantitative Structure Activity Relationship Models of Animal Carcinogenicity , 2007 .

[46]  T. Egan,et al.  Bayesian models trained with HTS data for predicting β-haematin inhibition and in vitro antimalarial activity. , 2015, Bioorganic & medicinal chemistry.

[47]  Chunyan Tan,et al.  Development and experimental test of support vector machines virtual screening method for searching Src inhibitors from large compound libraries , 2012, Chemistry Central Journal.

[48]  Igor V Tetko,et al.  A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition , 2011, J. Chem. Inf. Model..

[49]  Michael Günther,et al.  Introducing Wikidata to the Linked Data Web , 2014, SEMWEB.

[50]  Eugen Lounkine,et al.  Activity-Aware Clustering of High Throughput Screening Data and Elucidation of Orthogonal Structure-Activity Relationships , 2011, J. Chem. Inf. Model..

[51]  Egon L. Willighagen,et al.  PubChemRDF: towards the semantic annotation of PubChem compound and substance databases , 2015, Journal of Cheminformatics.

[52]  Jean-Louis Reymond,et al.  Visualisation and subsets of the chemical universe database GDB-13 for virtual screening , 2011, J. Comput. Aided Mol. Des..

[53]  Zhenming Liu,et al.  An Unbiased Method To Build Benchmarking Sets for Ligand-Based Virtual Screening and its Application To GPCRs , 2014, J. Chem. Inf. Model..

[54]  L. Brooke The National Library of Medicine. , 1980, Hospital libraries.

[55]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[56]  Evan Bolton,et al.  PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem , 2015, Nucleic Acids Res..

[57]  Andreas Bender,et al.  Recognizing Pitfalls in Virtual Screening: A Critical Review , 2012, J. Chem. Inf. Model..

[58]  Frank M. Boeckler,et al.  DEKOIS: Demanding Evaluation Kits for Objective in Silico Screening - A Versatile Tool for Benchmarking Docking Programs and Scoring Functions , 2011, J. Chem. Inf. Model..

[59]  Andrew C. Good,et al.  Measuring CAMD technique performance: A virtual screening case study in the design of validation experiments , 2004, J. Comput. Aided Mol. Des..

[60]  Sebastian G. Rohrer,et al.  Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data , 2009, J. Chem. Inf. Model..

[61]  Peter Buchwald,et al.  Activity-Limiting Role of Molecular Size: Size-Dependency of Maximum Activity for P450 Inhibition as Revealed by qHTS Data , 2014, Drug Metabolism and Disposition.

[62]  Basappa,et al.  Cheminformatics-Based Drug Design Approach for Identification of Inhibitors Targeting the Characteristic Residues of MMP-13 Hemopexin Domain , 2010, PloS one.

[63]  Youyong Li,et al.  ADMET evaluation in drug discovery. 12. Development of binary classification models for prediction of hERG potassium channel blockage. , 2012, Molecular pharmaceutics.

[64]  Nigel Shadbolt,et al.  Resource Description Framework (RDF) , 2009 .

[65]  Alina Bora,et al.  In silico classification and virtual screening of maleimide derivatives using projection to latent structures discriminant analysis (PLS-DA) and hybrid docking , 2012, Monatshefte für Chemie - Chemical Monthly.

[66]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[67]  Dachuan Zhang,et al.  MMDB and VAST+: tracking structural similarities between macromolecular complexes , 2013, Nucleic Acids Res..

[68]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[69]  Alexander Golbraikh,et al.  Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of the QSAR classifier to virtual screening , 2008, J. Comput. Aided Mol. Des..

[70]  Hao Zhu,et al.  Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology , 2014, PloS one.

[71]  David J Diller The synergy between combinatorial chemistry and high-throughput screening. , 2008, Current opinion in drug discovery & development.

[72]  Izhar Wallach,et al.  Virtual Decoy Sets for Molecular Docking Benchmarks , 2011, J. Chem. Inf. Model..

[73]  Evan Bolton,et al.  Literature information in PubChem: associations between PubChem records and scientific articles , 2016, Journal of Cheminformatics.

[74]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[75]  Ahmad Mani-Varnosfaderani,et al.  Integrated One‐Against‐One Classifiers as Tools for Virtual Screening of Compound Databases: A Case Study with CNS Inhibitors , 2013, Molecular informatics.

[76]  Igor Jurisica,et al.  SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents , 2011, Nucleic Acids Res..

[77]  Ben Y. Reis,et al.  Predicting Adverse Drug Events Using Pharmacological Network Models , 2011, Science Translational Medicine.

[78]  Y Z Chen,et al.  Identifying Novel Type ZBGs and Nonhydroxamate HDAC Inhibitors Through a SVM Based Virtual Screening Approach , 2010, Molecular informatics.

[79]  Jean-Louis Reymond,et al.  Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery , 2007, J. Chem. Inf. Model..

[80]  Pavel Dallakian,et al.  FlaME: Flash Molecular Editor - a 2D structure input tool for the web , 2011, J. Cheminformatics.

[81]  Bruno O Villoutreix,et al.  Computational investigations of hERG channel blockers: New insights and current predictive models. , 2015, Advanced drug delivery reviews.

[82]  Hanbing Rao,et al.  Identification of small molecule aggregators from large compound libraries by support vector machines , 2009, J. Comput. Chem..

[83]  Campbell McInnes,et al.  Virtual screening strategies in drug discovery. , 2007, Current opinion in chemical biology.

[84]  Jin Zhang,et al.  Toward a Benchmarking Data Set Able to Evaluate Ligand- and Structure-based Virtual Screening Using Public HTS Data , 2015, J. Chem. Inf. Model..

[85]  Christopher K. Surratt,et al.  Discovery of Novel-Scaffold Monoamine Transporter Ligands via in Silico Screening with the S1 Pocket of the Serotonin Transporter , 2014, ACS chemical neuroscience.

[86]  Claudio N. Cavasotto,et al.  Ligand and Decoy Sets for Docking to G Protein-Coupled Receptors , 2012, J. Chem. Inf. Model..

[87]  Ruili Huang,et al.  Mechanism Profiling of Hepatotoxicity Caused by Oxidative Stress Using Antioxidant Response Element Reporter Gene Assay Models and Big Data , 2015, Environmental health perspectives.

[88]  Bo-Han Su,et al.  Rule-Based Prediction Models of Cytochrome P450 Inhibition , 2015, J. Chem. Inf. Model..

[89]  George Papadatos,et al.  SureChEMBL: a large-scale, chemically annotated patent document database , 2015, Nucleic Acids Res..

[90]  Frank H. Allen,et al.  Cambridge Structural Database , 2002 .

[91]  Victor S. Sheng,et al.  Cost-Sensitive Learning , 2009, Encyclopedia of Data Warehousing and Mining.

[92]  Xin Chen,et al.  Discovery of Novel Pim-1 Kinase Inhibitors by a Hierarchical Multistage Virtual Screening Approach Based on SVM Model, Pharmacophore, and Molecular Docking , 2011, J. Chem. Inf. Model..

[93]  Bo-Han Su,et al.  Rule-Based Classification Models of Molecular Autofluorescence , 2015, J. Chem. Inf. Model..

[94]  Hafiz M.N. Iqbal,et al.  Improvement of Catalytic Efficiency, Thermo-stability and Dye Decolorization Capability of Pleurotus ostreatus IBL-02 laccase by Hydrophobic Sol Gel Entrapment , 2012, Chemistry Central Journal.

[95]  Frank M. Boeckler,et al.  Evaluation and Optimization of Virtual Screening Workflows with DEKOIS 2.0 - A Public Library of Challenging Docking Benchmark Sets , 2013, J. Chem. Inf. Model..

[96]  Dan C. Fara,et al.  Lead-like, drug-like or “Pub-like”: how different are they? , 2007, J. Comput. Aided Mol. Des..

[97]  Tudor I. Oprea,et al.  hERG classification model based on a combination of support vector machine method and GRIND descriptors. , 2008, Molecular pharmaceutics.

[98]  Bin Chen,et al.  PubChem as a Source of Polypharmacology , 2009, J. Chem. Inf. Model..

[99]  Joanna L. Sharman,et al.  The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands , 2015, Nucleic Acids Res..

[100]  Anne Mai Wassermann,et al.  Dark chemical matter as a promising starting point for drug lead discovery. , 2015, Nature chemical biology.

[101]  I. Rusyn,et al.  Use of in Vitro HTS-Derived Concentration–Response Data as Biological Descriptors Improves the Accuracy of QSAR Models of in Vivo Toxicity , 2010, Environmental health perspectives.

[102]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[103]  J. Reymond,et al.  Exploring chemical space for drug discovery using the chemical universe database. , 2012, ACS chemical neuroscience.

[104]  Bo-Han Su,et al.  A comprehensive support vector machine binary hERG classification model based on extensive but biased end point hERG data sets. , 2011, Chemical research in toxicology.

[105]  Adam Yasgar,et al.  Quantitative high-throughput screening: a titration-based approach that efficiently identifies biological activities in large chemical libraries. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[106]  Walter H. Moos,et al.  Combinatorial chemistry: oh what a decade or two can do , 2009, Molecular Diversity.

[107]  David Anderson,et al.  Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF , 2015, Journal of library metadata.

[108]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[109]  M. Congreve,et al.  A 'rule of three' for fragment-based lead discovery? , 2003, Drug discovery today.

[110]  Ali S. Arbab,et al.  Effect of Melatonin on Tumor Growth and Angiogenesis in Xenograft Model of Breast Cancer , 2014, PloS one.

[111]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[112]  Akira R. Kinjo,et al.  Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format , 2011, Nucleic Acids Res..

[113]  Chenglong Li,et al.  Novel Inhibitor Discovery through Virtual Screening against Multiple Protein Conformations Generated via Ligand-Directed Modeling: A Maternal Embryonic Leucine Zipper Kinase Example , 2012, J. Chem. Inf. Model..

[114]  Amanda C. Schierz Virtual screening of bioassay data , 2009, J. Cheminformatics.