DSI: A Method for Indexing Large Graphs Using Distance Set

Recent years we have witnessed a great increase in modeling data as large graphs in multiple domains, such as XML, the semantic web, social network. In these circumstances, researchers are interested in querying the large graph like that: Given a large graph G, and a query Q, we report all the matches of Q in G. Since subgraph isomorphism checking is proved to be NP-Complete[1], it is infeasible to scan the whole large graph for answers, especially when the query's size is also large. Hence, the "filter-verification" approach is widely adopted. In this approach, researchers first index the neighborhood of each vertex in the large graph, then filter vertexes, and finally perform subgraph matching algorithms. Previous techniques mainly focus on efficient matching algorithms, paying little attention to indexing techniques. However, appropriate indexing techniques could help improve the efficiency of query response by generating less candidates. In this paper we investigate indexing techniques on large graphs, and propose an index structure DSI(Distance Set Index) to capture the neighborhood of each vertex. Through our distance set index, more vertexes could be pruned, resulting in a much smaller search space. Then a subgraph matching algorithm is performed in the search space. We have applied our index structure to real datasets and synthetic datasets. Extensive experiments demonstrate the efficiency and effectiveness of our indexing technique.

[1]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[2]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[3]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[4]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Shijie Zhang,et al.  GADDI: distance index based subgraph matching in biological networks , 2009, EDBT '09.

[6]  Jacob D. Furst,et al.  Predictive Data Mining for Lung Nodule Interpretation , 2007 .

[7]  Matthieu Latapy,et al.  Efficient and simple generation of random simple connected graphs with prescribed degree sequence , 2005, J. Complex Networks.

[8]  Christos Faloutsos,et al.  Fast best-effort pattern matching in large attributed graphs , 2007, KDD '07.

[9]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Lei Zou,et al.  A novel spectral coding in a large graph database , 2008, EDBT '08.

[11]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[12]  Lei Zou,et al.  Top-k subgraph matching query in a large graph , 2007, PIKM '07.

[13]  Philip S. Yu,et al.  Towards Graph Containment Search and Indexing , 2007, VLDB.

[14]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[15]  Jignesh M. Patel,et al.  TALE: A Tool for Approximate Large Graph Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[17]  Christian Borgelt,et al.  Subgraph Support in a Single Large Graph , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[18]  Lei Chen,et al.  Continuous Subgraph Pattern Search over Graph Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[20]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.