Finding Frequent Substructures in Chemical Compounds

The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI'97 as a research challenge for artificial intelligence. Our approach to the problem is descriptive rather than based on classification; the goal being to find common substructures and properties in chemical compounds, and in this way to contribute to scientific insight. This approach contrasts with previous machine learning research on this problem, which has mainly concentrated on predicting the toxicity of unknown chemicals. Our contribution to the field of data mining is the ability to discover useful frequent patterns that are beyond the complexity of association rules or their known variants. This is vital to the problem, which requires the discovery of patterns that are out of the reach of simple transformations to frequent itemsets. We present a knowledge discovery method for structured data, where patterns reflect the one-to-many and many-to-many relationships of several tables. Background knowledge, represented in a uniform manner in some of the tables, has an essential role here, unlike in most data mining settings for the discovery of frequent patterns.

[1]  W. A. Sexton,et al.  STRUCTURE—ACTIVITY RELATIONSHIPS , 1958, The Journal of pharmacy and pharmacology.

[2]  Jeffrey D. Ullman,et al.  Principles Of Database And Knowledge-Base Systems , 1979 .

[3]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[4]  R. Tennant,et al.  Definitive relationships among chemical structure, carcinogenicity and mutagenicity for 301 chemicals tested by the U.S. NTP. , 1991, Mutation research.

[5]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[7]  Luc De Raedt,et al.  First-Order jk-Clausal Theories are PAC-Learnable , 1994, Artif. Intell..

[8]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[9]  Heikki Mannila,et al.  A Perspective on Databases and Data Mining , 1995, KDD.

[10]  Lawrence B. Holder,et al.  Analyzing the Benefits of Domain Knowledge in Substructure Discovery , 1995, KDD.

[11]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[12]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[13]  Carlo Zaniolo,et al.  Metaqueries for Data Mining , 1996, Advances in Knowledge Discovery and Data Mining.

[14]  R. King,et al.  Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. , 1996, Environmental health perspectives.

[15]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[16]  D. Bristol,et al.  The NIEHS Predictive-Toxicology Evaluation Project. , 1996, Environmental health perspectives.

[17]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[18]  Ashwin Srinivasan,et al.  The Predictive Toxicology Evaluation Challenge , 1997, IJCAI.

[19]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Three Dimensional Molecules , 1997, KDD.

[20]  Ke Wang,et al.  Schema Discovery for Semistructured Data , 1997, KDD.

[21]  Stefan Kramer,et al.  Mining for Causes of Cancer: Machine Learning Experiments at Various Levels of Detail , 1997, KDD.

[22]  Luc De Raedt,et al.  Mining Association Rules in Multiple Relations , 1997, ILP.

[23]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[24]  Ashwin Srinivasan,et al.  Carcinogenesis Predictions Using ILP , 1997, ILP.

[25]  Hannu Toivonen,et al.  Frequent query discovery: a unifying ILP approach to association rule mining , 1998 .

[26]  J. Dodge,et al.  Structure/activity relationships , 1998 .

[27]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..