Frequent sub-structure-based approaches for classifying chemical compounds

We study the problem of classifying chemical compound datasets. We present a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the dataset. The advantage of our approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and on the average, outperforms existing schemes by 10% to 35%.

[1]  R. M. Muir,et al.  Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients , 1962, Nature.

[2]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[3]  Toshio Fujita,et al.  The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients , 1963 .

[4]  J. Gasteiger,et al.  Automatic generation of 3D-atomic coordinates for organic molecules , 1990 .

[5]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[6]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Ashwin Srinivasan,et al.  The Predictive Toxicology Evaluation Challenge , 1997, IJCAI.

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[11]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[12]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[13]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[14]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[15]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Yuanyuan Wang,et al.  Comparisons of classification methods for screening potential compounds , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[17]  Lawrence B. Holder,et al.  Application of Graph-Based Concept Learning to the Predictive Toxicology Domain , 2001 .

[18]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[19]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[20]  George Karypis,et al.  Discovering frequent geometric subgraphs , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[22]  W. Graham Richards,et al.  Virtual screening using grid computing: the screensaver project , 2002, Nature Reviews Drug Discovery.

[23]  George Karypis,et al.  Automated Approaches for Classifying Structures , 2002, BIOKDD.

[24]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[25]  George Karypis,et al.  Using conjunction of attribute values for classification , 2002, CIKM '02.

[26]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[27]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[28]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[29]  Ashwin Srinivasan,et al.  Feature construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes , 1999, Data Mining and Knowledge Discovery.

[30]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.