Compact reachability labeling for graph-structured data

Testing reachability between nodes in a graph is a well-known problem with many important applications, including knowledge representation, program analysis, and more recently, biological and ontology databases inferencing as well as XML query processing. Various approaches have been proposed to encode graph reachability information using node labeling schemes, but most existing schemes only work well for specific types of graphs. In this paper, we propose a novel approach, HLSS(Hierarchical Labeling of Sub-Structures), which identifies different types of substructures within a graph and encodes them using techniques suitable to the characteristics of each of them. We implement HLSS with an efficient two-phase algorithm, where the first phase identifies and encodes strongly connected components as well as tree substructures, and the second phase encodes the remaining reachability relationships by compressing dense rectangular submatrices in the transitive closure matrix. For the important subproblem of finding densest submatrices, we demonstrate the hardness of the problem and propose several practical algorithms. Experiments show that HLSS handles different types of graphs well, while existing approaches fall prey to graphs with substructures they are not designed to handle.

[1]  Vassilis J. Tsotras,et al.  Twig query processing over graph-structured XML data , 2004, WebDB '04.

[2]  Patrick Lincoln,et al.  Efficient implementation of lattice operations , 1989, TOPL.

[3]  Alexander Borgida,et al.  Efficient management of transitive relationships in large data and knowledge bases , 1989, SIGMOD '89.

[4]  David Peleg,et al.  Labeling schemes for flow and connectivity , 2002, SODA '02.

[5]  Gene H. Golub,et al.  Matrix computations , 1983 .

[6]  Hao He,et al.  BOXes: efficient maintenance of order-based labeling for dynamic XML data , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Tiko Kameda,et al.  On the Vector Representation of the Reachability in Planar Directed Graphs , 1975, Inf. Process. Lett..

[8]  Rainer Unland,et al.  HID: An Efficient Path Index for Complex XML Collections with Arbitrary Links , 2005, DNIS.

[9]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[10]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[11]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[12]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.

[13]  Neoklis Polyzotis,et al.  Structure and Value Synopses for XML Data Graphs , 2002, VLDB.

[14]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Vassilis Christophides,et al.  On labeling schemes for the semantic web , 2003, WWW '03.

[16]  Esteban Zimányi,et al.  Semantic Visualization of Biochemical Databases , 2004, ICSNW.

[17]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[18]  Shimon Kogan,et al.  Hardness of approximation of the Balanced Complete Bipartite Subgraph problem , 2004 .

[19]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[20]  Yves Caseau Efficient handling of multiple inheritance hierarchies , 1993, OOPSLA '93.