Comparison of descriptor spaces for chemical compound retrieval and classification

In recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential drug-like compounds has been an active area of research. Many of the best-performing techniques for these tasks utilize a descriptor-based representation of the compound that captures various aspects of the underlying molecular graph’s topology. In this paper we compare five different set of descriptors that are currently used for chemical compound classification. We also introduce four different descriptors derived from all connected fragments present in the molecular graphs primarily for the purpose of comparing them to the currently used descriptor spaces and analyzing what properties of descriptor spaces are helpful in providing effective representation for molecular graphs. In addition, we introduce an extension to existing vector-based kernel functions to take into account the length of the fragments present in the descriptors. We experimentally evaluate the performance of the previously introduced and the new descriptors in the context of SVM-based classification and ranked-retrieval on 28 classification and retrieval problems derived from 18 datasets. Our experiments show that for both of these tasks, two of the four descriptors introduced in this paper along with the extended connectivity fingerprint based descriptors consistently and statistically outperform previously developed schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as well as recently introduced descriptors obtained by mining and analyzing the structure of the molecular graphs.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  M. Greenwood An Introduction to Medical Statistics , 1932, Nature.

[3]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[4]  B. Ames,et al.  Carcinogens are mutagens: a simple test system combining liver homogenates for activation and bacteria for detection. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[6]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[7]  E. Zeiger,et al.  Handbook of Carcinogenic Potency and Genotoxicity Databases , 1996 .

[8]  V. Seagroatt An introduction to medical statistics (2nd ed.) , 1996 .

[9]  Ashwin Srinivasan,et al.  The Predictive Toxicology Evaluation Challenge , 1997, IJCAI.

[10]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[11]  John M. Barnard,et al.  Chemical Fragment Generation and Clustering Software , 1997, J. Chem. Inf. Comput. Sci..

[12]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[15]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[16]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[17]  G. Schneider,et al.  Virtual Screening for Bioactive Molecules , 2000 .

[18]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[19]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[20]  G. Habermehl Molecular Structure Description , 2001 .

[21]  Nikolai S. Zefirov,et al.  Fragmental Approach in QSPR , 2002, J. Chem. Inf. Comput. Sci..

[22]  W. Graham Richards,et al.  Virtual screening using grid computing: the screensaver project , 2002, Nature Reviews Drug Discovery.

[23]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[25]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[26]  Errol Lewars,et al.  Computational chemistry , 2003 .

[27]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[28]  G. Harper,et al.  The reduced graph descriptor in virtual screening and data-driven clustering of high-throughput screening data. , 2004, Journal of chemical information and computer sciences.

[29]  Peter Willett,et al.  Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients , 2004, J. Chem. Inf. Model..

[30]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[31]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[32]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[33]  Ian A. Watson,et al.  Characteristic physical properties and structural fragments of marketed oral drugs. , 2004, Journal of medicinal chemistry.

[34]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..

[35]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[36]  Paolo Frasconi,et al.  Weighted decomposition kernels , 2005, ICML.

[37]  D. Rogers,et al.  Using Extended-Connectivity Fingerprints with Laplacian-Modified Bayesian Analysis in High-Throughput Screening Follow-Up , 2005, Journal of biomolecular screening.

[38]  David A. Cosgrove,et al.  Lead Hopping Using SVM and 3D Pharmacophore Fingerprints , 2005, J. Chem. Inf. Model..

[39]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[40]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[41]  George Karypis,et al.  Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification , 2006, ICDM.

[42]  Qiang Zhang,et al.  Scaffold hopping through virtual screening using 2D and 3D similarity descriptors: ranking, voting, and consensus scoring. , 2006, Journal of medicinal chemistry.

[43]  Ian A. Watson,et al.  ErG: 2D Pharmacophore Descriptions for Scaffold Hopping , 2006, J. Chem. Inf. Model..

[44]  Thierry Kogej,et al.  Multifingerprint Based Similarity Searches for Targeted Class Compound Selection , 2006, J. Chem. Inf. Model..

[45]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[46]  Benjamin G. Janesko Computational chemistry , 2007, Nature Reviews Drug Discovery.