Substructure Mining Using Elaborate Chemical Representation

Substructure mining algorithms are important drug discovery tools since they can find substructures that affect physicochemical and biological properties. Current methods, however, only consider a part of all chemical information that is present within a data set of compounds. Therefore, the overall aim of our study was to enable more exhaustive data mining by designing methods that detect all substructures of any size, shape, and level of chemical detail. A means of chemical representation was developed that uses atomic hierarchies, thus enabling substructure mining to consider general and/or highly specific features. As a proof-of-concept, the efficient, multipurpose graph mining system Gaston learned substructures of any size and shape from a mutagenicity data set that was represented in this manner. From these substructures, we extracted a set of only six nonredundant, discriminative substructures that represent relevant biochemical knowledge. Our results demonstrate the individual and synergistic importance of elaborate chemical representation and mining for nonlinear substructures. We conclude that the combination of elaborate chemical representation and Gaston provides an excellent method for 2D substructure mining as this recipe systematically explores all substructures in different levels of chemical detail.

[1]  S Parodi,et al.  Relationship between molecular connectivity and carcinogenic activity: a confirmation with a new software program based on graph theory. , 1993, Environmental health perspectives.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  N. Kruhlak,et al.  In silico screening of chemicals for bacterial mutagenicity using electrotopological E-state indices and MDL QSAR software. , 2005, Regulatory toxicology and pharmacology : RTP.

[4]  Ferenc Darvas,et al.  HazardExpert: An Expert System for Predicting Chemical Toxicity , 1992 .

[5]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..

[6]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  J. Ashby,et al.  Prediction of Salmonella mutagenicity. , 1996, Mutagenesis.

[8]  J E Ridings,et al.  Computer prediction of possible toxic action from chemical structure: an update on the DEREK system. , 1996, Toxicology.

[9]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Alan G. E. Wilson,et al.  A multiple in silico program approach for the prediction of mutagenicity from chemical structure. , 2003, Mutation research.

[11]  Z R Li,et al.  Prediction of genotoxicity of chemical compounds by statistical learning methods. , 2005, Chemical research in toxicology.

[12]  Takashi Washio,et al.  Applying the Apriori-based Graph Mining Method to Mutagenesis Data Analysis , 2001 .

[13]  S Parodi,et al.  A computerized connectivity approach for analyzing the structural basis of mutagenicity in Salmonella and its relationship with rodent carcinogenicity , 1996, Environmental and molecular mutagenesis.

[14]  L. Hall,et al.  Three new consensus QSAR models for the prediction of Ames genotoxicity. , 2004, Mutagenesis.

[15]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[16]  Tatsuya Akutsu,et al.  Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines , 2005, J. Chem. Inf. Model..

[17]  H S Rosenkranz,et al.  Testing by artificial intelligence: computational alternatives to the determination of mutagenicity. , 1992, Mutation research.

[18]  G M Pearl,et al.  Integration of computational analysis as a sentinel tool in toxicological assessments. , 2001, Current topics in medicinal chemistry.

[19]  G. Klopman Artificial intelligence approach to structure-activity studies. Computer automated structure evaluation of biological activity of organic molecules , 1985 .

[20]  J. Kazius,et al.  Derivation and validation of toxicophores for mutagenicity prediction. , 2005, Journal of medicinal chemistry.

[21]  H O Villar,et al.  Toward the design of chemical libraries for mass screening biased against mutagenic compounds. , 2001, Journal of medicinal chemistry.

[22]  Wolf-Dietrich Ihlenfeldt,et al.  Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility , 1994, J. Chem. Inf. Comput. Sci..

[23]  Philip N. Judson Rule induction for systems predicting biological activity , 1994, J. Chem. Inf. Comput. Sci..

[24]  R. Snyder,et al.  Assessment of the sensitivity of the computational programs DEREK, TOPKAT, and MCASE in the prediction of the genotoxicity of pharmaceutical molecules , 2004, Environmental and molecular mutagenesis.

[25]  A. Giuliani,et al.  Computer-assisted analysis of interlaboratory Ames test variability. , 1988, Journal of toxicology and environmental health.

[26]  R. Tennant,et al.  Definitive relationships among chemical structure, carcinogenicity and mutagenicity for 301 chemicals tested by the U.S. NTP. , 1991, Mutation research.

[27]  K Enslein,et al.  International Commission for Protection Against Environmental Mutagens and Carcinogens. Use of SAR in computer-assisted prediction of carcinogenicity and mutagenicity of chemicals by the TOPKAT program. , 1994, Mutation research.

[28]  D. Sanderson,et al.  Computer Prediction of Possible Toxic Action from Chemical Structure; The DEREK System , 1991, Human & experimental toxicology.

[29]  Luc De Raedt,et al.  The molecular feature miner MolFea , 2003 .

[30]  K. Enslein,et al.  Use of SAR in computer-assited prediction of carcinogenicity and mutagenicity of chemicals by the TOPKAT program , 1994 .

[31]  Y T Woo,et al.  Development of structure-activity relationship rules for predicting carcinogenic potential of chemicals. , 1995, Toxicology letters.

[32]  Romualdo Benigni,et al.  The Development and Validation of Expert Systems for Predicting Toxicity The Report and Recommendations of an ECVAM / ECB Workshop ( ECVAM Workshop 24 ) , 2002 .

[33]  Errol Zeiger,et al.  Measuring Intra-Assay Agreement for the Ames Salmonella Assay , 1991 .

[34]  Christian Borgelt,et al.  Large scale mining of molecular fragments with wildcards , 2004, Intell. Data Anal..

[35]  Christophe G. Lambert,et al.  Mixture deconvolution and analysis of Ames mutagenicity data , 2002 .

[36]  T. Sugimura,et al.  ICPEMC News No. 2 , 1980, Environmental Health Perspectives.

[37]  G. Klopman,et al.  Searching for an Enhanced Predictive Tool for Mutagenicity , 2004, SAR and QSAR in environmental research.

[38]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[39]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[40]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[41]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[43]  J. Ashby Fundamental structural alerts to potential carcinogenicity or noncarcinogenicity. , 1985, Environmental mutagenesis.

[44]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[45]  Susan Y. Tamura,et al.  Rule Extraction from a Mutagenicity Data Set Using Adaptively Grown Phylogenetic-like Trees , 2002, J. Chem. Inf. Comput. Sci..

[46]  G. Klopman MULTICASE 1. A Hierarchical Computer Automated Structure Evaluation Program , 1992 .