Frequent Subgraph Retrieval in Geometric Graph Databases

Discovery of knowledge from geometric graph databases is of particular importance in chemistry and biology, because chemical compounds and proteins are represented as graphs with 3D geometric coordinates. In such applications, scientists are not interested in the statistics of the whole database. Instead they need information about a novel drug candidate or protein at hand, represented as a query graph. We propose a polynomial-delay algorithm for geometric frequent subgraph retrieval. It enumerates all subgraphs of a single given query graph which are frequent geometric epsi-subgraphs under the entire class of rigid geometric transformations in a database. By using geometric epsi-subgraphs, we achieve tolerance against variations in geometry. We compare the proposed algorithm to gSpan on chemical compound data, and we show that for a given minimum support the total number of frequent patterns is substantially limited by requiring geometric matching. Although the computation time per pattern is larger than for non-geometric graph mining, the total time is within a reasonable level even for small minimum support.

[1]  Sebastian Nowozin,et al.  Weighted Substructure Mining for Image Analysis , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[3]  Hiroto Saigo,et al.  A Linear Programming Approach for Molecular QSAR analysis , 2006 .

[4]  Patricia C. Babbitt,et al.  Automated discovery of 3D motifs for protein function annotation , 2006, Bioinform..

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  H. Fang,et al.  Comparative molecular field analysis (CoMFA) model using a large diverse set of natural, synthetic and environmental chemicals for binding to the androgen receptor , 2003, SAR and QSAR in environmental research.

[7]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  David Avis,et al.  Reverse Search for Enumeration , 1996, Discret. Appl. Math..

[9]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[10]  Hiroki Arimura,et al.  Time and Space Efficient Discovery of Maximal Geometric Subgraphs May 7 , 2007 , 2007 .

[11]  Robert Sedgewick,et al.  Algorithms in c, part 5: graph algorithms, third edition , 2001 .

[12]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[13]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[14]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[15]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[16]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[17]  Hiroki Arimura,et al.  Time and Space Efficient Discovery of Maximal Geometric Graphs , 2007, Discovery Science.

[18]  W. Tong,et al.  QSAR Models Using a Large Diverse Set of Estrogens. , 2001 .

[19]  Gregory H. Harris,et al.  Review of "Algorithms in C++, third edition by Robert Sedgewick." Addison-Wesley 2002. , 2003, SOEN.

[20]  Wei Wang,et al.  Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs , 2005, J. Comput. Biol..

[21]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[22]  Thomas Bäck,et al.  Substructure Mining Using Elaborate Chemical Representation , 2006, J. Chem. Inf. Model..

[23]  C. Sander,et al.  Dali: a network tool for protein structure comparison. , 1995, Trends in biochemical sciences.

[24]  George Karypis,et al.  Discovering frequent geometric subgraphs , 2007, Inf. Syst..

[25]  Taku Kudo,et al.  Clustering graphs by weighted substructure mining , 2006, ICML.

[26]  R. Russell,et al.  Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. , 1998, Journal of molecular biology.