Graph indexing: a frequent structure-based approach

Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the issues of indexing graphs and propose a novel solution by applying a graph mining technique. Different from the existing path-based methods, our approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. To reduce the size of index structure, two techniques, size-increasing support constraint and discriminative fragments, are introduced. Our performance study shows that gIndex has 10 times smaller index size, but achieves 3--10 times better performance in comparison with a typical path-based method, GraphGrep. The gIndex approach not only provides and elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit form data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be applied to indexing sequences, trees, and other complicated structures as well.

[1]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[2]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[4]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[5]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[6]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[7]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[8]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[9]  Srinath Srinivasa,et al.  A Platform Based on the Multi-dimensional Data Model for Analysis of Bio-Molecular Structures , 2003, VLDB.

[10]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[11]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[12]  I. Rigoutsos,et al.  Geometric Hashing , 1997, IEEE Computational Science and Engineering.

[13]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[14]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Ali Shokoufandeh,et al.  Indexing using a spectral encoding of topological structure , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[16]  Euripides G. M. Petrakis,et al.  Similarity Searching in Medical Image Databases , 1997, IEEE Trans. Knowl. Data Eng..

[17]  Alberto Del Bimbo,et al.  Efficient Matching and Indexing of Graph Models in Content-Based Retrieval , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[19]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[20]  Ehud Gudes,et al.  Computing frequent graph patterns from semistructured data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.