Acyclic Subgraph Based Descriptor Spaces for Chemical Compound Retrieval and Classification

Abstract : In recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential drug-like compounds has been an active area of research. These techniques are used extensively at various phases during the drug development process. Many of the best-performing techniques for these tasks utilize a descriptor-based representation of the compound that captures various aspects of the underlying molecular graph's topology. In this paper we introduce and describe algorithms for efficiently generating a new set of descriptors that are derived from all connected acrylic fragments present in the molecular graphs. In addition, we introduce an extension to existing vector-based kernel functions to take into account the length of the fragments present in the descriptors. We experimentally evaluate the performance of the new descriptors in the context of SVM-based classification and ranked-retrieval on 28 classification and retrieval problems derived from 17 datasets. Our experiments show that for both the classification and retrieval tasks, these new descriptors consistently and statistically outperform previously developed schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as well as recently introduced descriptors obtained by mining and analyzing the structure of the molecular graphs.

[1]  M. Greenwood An Introduction to Medical Statistics , 1932, Nature.

[2]  Michael F. Lynch,et al.  Strategic Considerations in the Design of a Screening System for Substructure Searches of Chemical Structure Files , 1973 .

[3]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[4]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[5]  Ashwin Srinivasan,et al.  The Predictive Toxicology Evaluation Challenge , 1997, IJCAI.

[6]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[7]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[10]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[11]  Peter Willett,et al.  Descriptor‐Based Similarity Measures for Screening Chemical Databases , 2000 .

[12]  G. Schneider,et al.  Virtual Screening for Bioactive Molecules , 2000 .

[13]  Darren V. S. Green,et al.  Modelling Structure‐Activity Relationships , 2000 .

[14]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[15]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[16]  G. Habermehl Molecular Structure Description , 2001 .

[17]  Nikolai S. Zefirov,et al.  Fragmental Approach in QSPR , 2002, J. Chem. Inf. Comput. Sci..

[18]  W. Graham Richards,et al.  Virtual screening using grid computing: the screensaver project , 2002, Nature Reviews Drug Discovery.

[19]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.

[20]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[22]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[24]  J. L. Durant,et al.  Reoptimization of MDL Keys for Use in Drug Discovery. , 2003 .

[25]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[26]  Peter Willett,et al.  Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients , 2004, J. Chem. Inf. Model..

[27]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[28]  Ian A. Watson,et al.  Characteristic physical properties and structural fragments of marketed oral drugs. , 2004, Journal of medicinal chemistry.

[29]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..

[30]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[31]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[32]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.