Scalable reachability indexing for very large graphs

Answering reachability queries in graphs is an important problem. With the development of high-throughput data acquisition techniques and the advances in the areas of semantic web and social networks, we have abundance of enormous graph-structured data on which different queries are asked. One of the fundamental queries, a reachability query, asks whether there exists a path between any two given nodes. This can map to the question of whether one researcher has been influenced by another in a citation network; whether a protein inhibits or activates another one indirectly in a protein interaction network; whether a protein is broken down to a specific molecule in a metabolic pathway graphs; or whether a concept is subsumed by part of another in an ontology. Aside from these direct correspondences with real-life questions, they can constitute building blocks for complicated queries in various databases. Therefore, there is a crucial need for mechanisms that expedite querying in graph databases. Existing methods for reachability trade-off indexing time and space versus query time performance. However, the biggest limitation of existing methods is that they do not scale to very large real-world graphs. They are also vulnerable to increasing edge densities. Another limitation of the existing methods is that they barely, if at all, support dynamic updates. This is primarily due to the complex nature of the problem — a single edge addition or deletion can potentially affect the reachability of all pairs of nodes in the graph. Most of the previous work has focused on dynamically maintaining the transitive closure of a graph, which has the obvious O(n 2) worst-case bound, where n is the number of nodes. Moreover, most of the static indexes cannot be directly generalized to the dynamic case. This is because these indexes trade-off the computationally intensive preprocessing/index construction stage to minimize the index size and querying time. For dynamic graphs, the efficiency of the update operations is another aspect which needs to be optimized. However, the costly index construction typically precludes fast updates. It is interesting to note that a simple approach consisting of depth-first search (DFS) can handle graph updates in O(1) time and queries in O(n + m) time, where m is the number of edges. For sparse graphs m = O(n) so that query time is O(n) for most large xi real-world graphs. Any dynamic index will be effective only if it can amortize the update costs over many reachability queries. In this thesis, we present two approaches for addressing the problems of scalable reachability indexing for both static and dynamic graphs. More specifically, we introduce two indexing schemes, namely GRAIL and DAGGER. GRAIL is a simple yet scalable reachability index that is based on the idea of randomized interval labeling, and that can effectively handle very large graphs. Based on an extensive set of experiments, we show that while more sophisticated methods work better on small graphs, GRAIL is the only index that can scale to millions of nodes and edges. GRAIL has linear indexing time and space, and the query time ranges from constant time to being linear in the graph order and size. Our second contribution is a scalable, light-weight reachability index for dynamic graphs called DAGGER which has linear (in the order of the graph) index size and index construction time, and reasonably fast update and query times. DAGGER is based on the idea of maintaining randomized interval labels for the nodes of the underlying acyclic graph (DAG) of the input graph. Therefore DAGGER yields an efficient algorithm for maintaining the strongly connected components of the evolving graph, which is of independentinterest. We demonstrate the efficiency and effectiveness of DAGGER in large dynamic real-world networks such as Wikipedia graph and citation networks as well as synthetic dynamic graphs. In the future, we plan to improve the query time of DAGGER by maximizing the quality of the index while keeping the updates fast. We also plan to extend GRAIL and DAGGER for other variants of reachability problem such as constrained reachability and shortest path queries.

[1]  Yang Xiang,et al.  Computing label-constraint reachability in graph databases , 2010, SIGMOD Conference.

[2]  Valerie King,et al.  A fully dynamic algorithm for maintaining the transitive closure , 1999, STOC '99.

[3]  TarjanRobert Endre,et al.  Fast algorithms for finding nearest common ancestors , 1984 .

[4]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[5]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[6]  Jeffrey F. Naughton,et al.  Updates for Structure Indexes , 2002, VLDB.

[7]  Yang Xiang,et al.  Efficiently answering reachability queries on very large directed graphs , 2008, SIGMOD Conference.

[8]  Jeffrey F. Naughton,et al.  Estimating the Size of Generalized Transitive Closures , 1989, VLDB.

[9]  Satish Rao,et al.  Computing vertex connectivity: new bounds from old techniques , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[10]  Gerhard Weikum,et al.  Efficient creation and incremental maintenance of the HOPI index for complex XML document collections , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Robert E. Tarjan,et al.  A data structure for dynamic trees , 1981, STOC '81.

[12]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[13]  Fang Wei-Kleiner,et al.  TEDI: Efficient Shortest Path Query Answering on Graphs , 2010, Graph Data Management.

[14]  Uri Zwick,et al.  A fully dynamic reachability algorithm for directed graphs with an almost linear update time , 2004, STOC '04.

[15]  Gerhard Weikum,et al.  HOPI: An Efficient Connection Index for Complex XML Document Collections , 2004, EDBT.

[16]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[17]  Li Chen,et al.  Stack-based Algorithms for Pattern Matching on DAGs , 2005, VLDB.

[18]  Edith Cohen,et al.  Estimating the size of the transitive closure in linear time , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[19]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[20]  Byron Choi,et al.  On incremental maintenance of 2-hop labeling of graphs , 2008, WWW.

[21]  Martin Müller,et al.  Depth-First Discovery Algorithm for incremental topological sorting of directed acyclic graphs , 2003, Inf. Process. Lett..

[22]  Dimitris Sacharidis,et al.  Evaluating Reachability Queries over Path Collections , 2009, SSDBM.

[23]  Tiko Kameda,et al.  On the Vector Representation of the Reachability in Planar Directed Graphs , 1975, Inf. Process. Lett..

[24]  Haim Kaplan,et al.  Compact labeling schemes for ancestor queries , 2001, SODA '01.

[25]  H. V. Jagadish,et al.  A compression technique to materialize transitive closure , 1990, TODS.

[26]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[27]  Stephen Alstrup,et al.  Improved labeling scheme for ancestor queries , 2002, SODA '02.

[28]  David Eppstein,et al.  Sparsification—a technique for speeding up dynamic graph algorithms , 1997, JACM.

[29]  Yangjun Chen,et al.  General spanning trees and reachability query evaluation , 2009, C3S2E '09.

[30]  Giuseppe F. Italiano,et al.  Fully dynamic transitive closure: breaking through the O(n/sup 2/) barrier , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[31]  Mikkel Thorup,et al.  Near-optimal fully-dynamic graph connectivity , 2000, STOC '00.

[32]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[33]  Monika Henzinger,et al.  Fully dynamic biconnectivity and transitive closure , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[34]  Mikkel Thorup,et al.  A Space Saving Trick for Directed Dynamic Transitive Closure and Shortest Path Algorithms , 2001, COCOON.

[35]  Liam Roditty,et al.  A faster and simpler fully dynamic transitive closure , 2003, SODA '03.

[36]  Mohammed J. Zaki,et al.  GRAIL , 2010, Proc. VLDB Endow..

[37]  Philip S. Yu,et al.  Compact reachability labeling for graph-structured data , 2005, CIKM '05.

[38]  Byron Choi,et al.  Incremental Maintenance of 2-Hop Labeling of Large Graphs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[39]  Mohammed J. Zaki,et al.  GRAIL: a scalable index for reachability queries in very large graphs , 2011, The VLDB Journal.

[40]  Alexander Borgida,et al.  Efficient management of transitive relationships in large data and knowledge bases , 1989, SIGMOD '89.

[41]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[42]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[43]  Yang Xiang,et al.  3-HOP: a high-compression indexing scheme for reachability query , 2009, SIGMOD Conference.

[44]  Monika Henzinger,et al.  Certificates and Fast Algorithms for Biconnectivity in Fully-Dynamic Graphs , 1995, ESA.

[45]  Satish Rao,et al.  Computing Vertex Connectivity: New Bounds from Old Techniques , 2000, J. Algorithms.

[46]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[47]  Christos D. Zaroliagis,et al.  An experimental study of algorithms for fully dynamic transitive closure , 2005, JEAL.

[48]  Mikkel Thorup,et al.  Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity , 2001, JACM.

[49]  Philip S. Yu,et al.  Fast computing reachability labelings for large graphs with high compression rate , 2008, EDBT '08.

[50]  Jing Cai,et al.  Path-hop: efficiently indexing large graphs for reachability queries , 2010, CIKM.

[51]  Robert E. Tarjan,et al.  Faster Algorithms for Incremental Topological Ordering , 2008, ICALP.

[52]  Ulf Leser,et al.  Fast and practical indexing and querying of very large graphs , 2007, SIGMOD '07.

[53]  Zhe Wu,et al.  Implementing an Inference Engine for RDFS/OWL Constructs and User-Defined Rules in Oracle , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[54]  Giuseppe F. Italiano,et al.  Dynamic shortest paths and transitive closure: Algorithmic techniques and data structures , 2006, J. Discrete Algorithms.

[55]  Jeffrey Xu Yu,et al.  On-line exact shortest distance query processing , 2009, EDBT '09.

[56]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[57]  Greg N. Frederickson,et al.  Data structures for on-line updating of minimum spanning trees , 1983, STOC.

[58]  Philip S. Yu,et al.  Fast Computation of Reachability Labeling for Large Graphs , 2006, EDBT.

[59]  Christos D. Zaroliagis,et al.  An Experimental Study of Algorithms for Fully Dynamic Transitive Closure , 2005, ESA.

[60]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.

[61]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[62]  Yangjun Chen,et al.  An Efficient Algorithm for Answering Graph Reachability Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[63]  Haim Kaplan,et al.  Compact Labeling Scheme for XML Ancestor Queries , 2005, Theory of Computing Systems.

[64]  Robert E. Tarjan,et al.  A fast algorithm for finding dominators in a flowgraph , 1979, TOPL.

[65]  Philip S. Yu,et al.  Dual Labeling: Answering Graph Reachability Queries in Constant Time , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[66]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[67]  Valerie King,et al.  Fully dynamic algorithms for maintaining all-pairs shortest paths and transitive closure in digraphs , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).