Data mining PubChem using a support vector machine with the Signature molecular descriptor: classification of factor XIa inhibitors.

The amount of high-throughput screening (HTS) data readily available has significantly increased because of the PubChem project (http://pubchem.ncbi.nlm.nih.gov/). There is considerable opportunity for data mining of small molecules for a variety of biological systems using cheminformatic tools and the resources available through PubChem. In this work, we trained a support vector machine (SVM) classifier using the Signature molecular descriptor on factor XIa inhibitor HTS data. The optimal number of Signatures was selected by implementing a feature selection algorithm of highly correlated clusters. Our method included an improvement that allowed clusters to work together for accuracy improvement, where previous methods have scored clusters on an individual basis. The resulting model had a 10-fold cross-validation accuracy of 89%, and additional validation was provided by two independent test sets. We applied the SVM to rapidly predict activity for approximately 12 million compounds also deposited in PubChem. Confidence in these predictions was assessed by considering the number of Signatures within the training set range for a given compound, defined as the overlap metric. To further evaluate compounds identified as active by the SVM, docking studies were performed using AutoDock. A focused database of compounds predicted to be active was obtained with several of the compounds appreciably dissimilar to those used in training the SVM. This focused database is suitable for further study. The data mining technique presented here is not specific to factor XIa inhibitors, and could be applied to other bioassays in PubChem where one is looking to expand the search for small molecules as chemical probes.

[1]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[2]  R. Babine,et al.  Crystal Structures of the FXIa Catalytic Domain in Complex with Ecotin Mutants Reveal Substrate-like Interactions* , 2005, Journal of Biological Chemistry.

[3]  Gisbert Schneider,et al.  SVM-Based Feature Selection for Characterization of Focused Compound Collections , 2004, J. Chem. Inf. Model..

[4]  Jean-Loup Faulon,et al.  The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in QSAR and QSPR Studies , 2003, J. Chem. Inf. Comput. Sci..

[5]  R. Babine,et al.  Synthesis, SAR exploration, and X-ray crystal structures of factor XIa inhibitors containing an alpha-ketothiazole arginine. , 2006, Bioorganic & medicinal chemistry letters.

[6]  Bin Zhou,et al.  Large-Scale Annotation of Small-Molecule Libraries Using Public Databases , 2007, J. Chem. Inf. Model..

[7]  Yu Zong Chen,et al.  Prediction of Cytochrome P450 3A4, 2D6, and 2C9 Inhibitors and Substrates by Using Support Vector Machines , 2005, J. Chem. Inf. Model..

[8]  Igor V. Pletnev,et al.  Drug Discovery Using Support Vector Machines. The Case Studies of Drug-likeness, Agrochemical-likeness, and Enzyme Inhibition Predictions , 2003, J. Chem. Inf. Comput. Sci..

[9]  K. Fujikawa,et al.  Human plasma prekallikrein, a zymogen to a serine protease that contains four tandem repeats. , 1986, Biochemistry.

[10]  Jean-Loup Faulon,et al.  Developing a methodology for an inverse quantitative structure-activity relationship using the signature molecular descriptor. , 2002, Journal of molecular graphics & modelling.

[11]  Meir Glick,et al.  Enrichment of High-Throughput Screening Data with Increasing Levels of Noise Using Support Vector Machines, Recursive Partitioning, and Laplacian-Modified Naive Bayesian Classifiers , 2006, J. Chem. Inf. Model..

[12]  Jean-Loup Faulon,et al.  Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules , 1994, Journal of chemical information and computer sciences.

[13]  Jean-Loup Faulon,et al.  Genome scale enzyme–metabolite and drug–target interaction predictions using the signature molecular descriptor , 2008 .

[14]  Louise C. Showe,et al.  Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data , 2007, BMC Bioinformatics.

[15]  Michael K. Gilson,et al.  Virtual Screening of Molecular Databases Using a Support Vector Machine , 2005, J. Chem. Inf. Model..

[16]  John G. Gray,et al.  Medicine , 1902, Glasgow Medical Journal.

[18]  Jean-Loup Faulon,et al.  The signature molecular descriptor. 3. Inverse-quantitative structure-activity relationship of ICAM-1 inhibitory peptides. , 2003, Journal of molecular graphics & modelling.

[19]  K Fujikawa,et al.  Activation of human blood coagulation factor XI independent of factor XII. Factor XI is activated by thrombin and factor XIa in the presence of negatively charged surfaces. , 1991, The Journal of biological chemistry.

[20]  Tudor I. Oprea,et al.  hERG classification model based on a combination of support vector machine method and GRIND descriptors. , 2008, Molecular pharmaceutics.

[21]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[22]  Jean-Loup Faulon,et al.  The Signature Molecular Descriptor. 2. Enumerating Molecules from Their Extended Valence Sequences , 2003, J. Chem. Inf. Comput. Sci..

[23]  Zhi-Wei Cao,et al.  Effect of Selection of Molecular Descriptors on the Prediction of Blood-Brain Barrier Penetrating and Nonpenetrating Agents by Statistical Learning Methods , 2005, J. Chem. Inf. Model..

[24]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[25]  Supawadee Ingsriswang,et al.  sMOL Explorer: an open source, web-enabled database and exploration tool for Small MOLecules datasets , 2007, Bioinform..

[26]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[27]  R. Babine,et al.  Synthesis and in vitro biological evaluation of aryl boronic acids as potential inhibitors of factor XIa. , 2006, Bioorganic & medicinal chemistry letters.

[28]  David J Diller,et al.  Deriving knowledge through data mining high-throughput screening data. , 2004, Journal of medicinal chemistry.

[29]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[30]  David S. Goodsell,et al.  Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function , 1998, J. Comput. Chem..

[31]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[32]  I. Wilson,et al.  Virtual screening of human 5-aminoimidazole-4-carboxamide ribonucleotide transformylase against the NCI diversity set by use of AutoDock to identify novel nonfolate inhibitors. , 2004, Journal of medicinal chemistry.

[33]  Xin Chen,et al.  Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents , 2004, J. Chem. Inf. Model..

[34]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[35]  Johannes Grotendorst,et al.  Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques. , 2007 .

[37]  J. Gasteiger,et al.  Chemoinformatics: A Textbook , 2003 .

[38]  David S. Goodsell,et al.  Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function , 1998 .

[39]  Kerby Shedden,et al.  A Cheminformatic Toolkit for Mining Biomedical Knowledge , 2007, Pharmaceutical Research.

[40]  E. Zerhouni The NIH Roadmap , 2003, Science.

[41]  R. Babine,et al.  Design, synthesis, and biological evaluation of peptidomimetic inhibitors of factor XIa as novel anticoagulants. , 2006, Journal of medicinal chemistry.

[42]  Dariusz Plewczynski,et al.  Target specific compound identification using a support vector machine. , 2007, Combinatorial chemistry & high throughput screening.

[43]  Evan Bolton,et al.  Fast 3D shape screening of large chemical databases through alignment-recycling , 2007, Chemistry Central journal.

[44]  Xiaomin Luo,et al.  Mutagenic probability estimation of chemical compounds by a novel molecular electrophilicity vector and support vector machine , 2006, Bioinform..

[45]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[46]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[47]  Bernd Beck,et al.  A support vector machine approach to classify human cytochrome P450 3A4 inhibitors , 2005, J. Comput. Aided Mol. Des..

[48]  Robert D. Carr,et al.  The Signature Molecular Descriptor. 4. Canonizing Molecules Using Extended Valence Sequences , 2004, J. Chem. Inf. Model..

[49]  Tudor I. Oprea,et al.  Systems Chemical Biology , 2019, Methods in Molecular Biology.

[50]  Xiang-Qun Xie,et al.  Data Mining a Small Molecule Drug Screening Representative Subset from NIH PubChem , 2008, J. Chem. Inf. Model..

[51]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..