Efficient query processing on graph databases

We study the problem of processing subgraph queries on a database that consists of a set of graphs. The answer to a subgraph query is the set of graphs in the database that are supergraphs of the query. In this article, we propose an efficient index, FG*-index, to solve this problem. The cost of processing a subgraph query using most existing indexes mainly consists of two parts: the index probing cost and the candidate verification cost. Index probing is to find the query in the index, or to find the graphs from which we can generate a candidate answer set for the query. Candidate verification is to test whether each graph in the candidate set is indeed a supergraph of the query. We design FG*-index to minimize these two costs as follows. FG*-index consists of three components: the FG-index, the feature-index, and the FAQ-index. First, the FG-index employs the concept of Frequent subGraph (FG) to allow the set of queries that are FGs to be answered without candidate verification. We call this set of queries FG-queries. We can enlarge the set of FG-queries so that more queries can be answered without candidate verification; however, a larger set of FG-queries implies a larger FG-index and hence the index probing cost also increases. We propose the feature-index to reduce the index probing cost. The feature-index uses features to filter false results that are matched in the FG-index, so that we can quickly find the truly matching graphs for a query. For processing non-FG-queries, we propose the FAQ-index, which is dynamically constructed from the set of Frequently Asked non-FG-Queries (FAQs). Using the FAQ-index, verification is not required for processing FAQs and only a small number of candidates need to be verified for processing non-FG-queries that are not frequently asked. Finally, a comprehensive set of experiments verifies that query processing using FG*-index is up to orders of magnitude more efficient than state-of-the-art indexes and it is also more scalable.

[1]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[2]  Yehuda Koren,et al.  Measuring and extracting proximity in networks , 2006, KDD '06.

[3]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[5]  Philip S. Yu,et al.  Graph indexing based on discriminative frequent structure analysis , 2005, TODS.

[6]  Wilfred Ng,et al.  Correlation search in graph databases , 2007, KDD '07.

[7]  Wilfred Ng,et al.  XQzip: Querying Compressed XML Using Structural Indexing , 2004, EDBT.

[8]  Wilfred Ng,et al.  Effective elimination of redundant association rules , 2007, Data Mining and Knowledge Discovery.

[9]  Wilfred Ng,et al.  Maintaining frequent closed itemsets over a sliding window , 2008, Journal of Intelligent Information Systems.

[10]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[11]  Shijie Zhang,et al.  TreePi: A Novel Graph Indexing Method , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[13]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[14]  Hongjun Lu,et al.  False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams , 2004, VLDB.

[15]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[16]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  RalfHiutmut Gtiting,et al.  GraphDB : Modeling and Querying Graphs in Databases , 1998 .

[18]  Wilfred Ng,et al.  \delta-Tolerance Closed Frequent Itemsets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[19]  M. Tamer Özsu,et al.  FIX: feature-based indexing technique for XML documents , 2006, VLDB.

[20]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[21]  Christos Faloutsos,et al.  Fast best-effort pattern matching in large attributed graphs , 2007, KDD '07.

[22]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[23]  Wei Wang,et al.  Mining protein family specific residue packing patterns from protein structure graphs , 2004, RECOMB.

[24]  Christos Faloutsos,et al.  Center-piece subgraphs: problem definition and fast solutions , 2006, KDD '06.

[25]  Wilfred Ng,et al.  An Efficient Index Lattice for XML Query Evaluation , 2007, DASFAA.

[26]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[27]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[28]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[29]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[31]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[32]  Wilfred Ng,et al.  Efficient Correlation Search from Graph Databases , 2008, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ralf Hartmut Güting,et al.  GraphDB: Modeling and Querying Graphs in Databases , 1994, VLDB.

[34]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[35]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[36]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[37]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[38]  Srinath Srinivasa,et al.  A Platform Based on the Multi-dimensional Data Model for Analysis of Bio-Molecular Structures , 2003, VLDB.

[39]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).