Performance Guarantees for Distributed Reachability Queries

In the real world a graph is often fragmented and distributed across different sites. This highlights the need for evaluating queries on distributed graphs. This paper proposes distributed evaluation algorithms for three classes of queries: reachability for determining whether one node can reach another, bounded reachability for deciding whether there exists a path of a bounded length between a pair of nodes, and regular reachability for checking whether there exists a path connecting two nodes such that the node labels on the path form a string in a given regular expression. We develop these algorithms based on partial evaluation, to explore parallel computation. When evaluating a query Q on a distributed graph G, we show that these algorithms possess the following performance guarantees, no matter how G is fragmented and distributed: (1) each site is visited only once; (2) the total network traffic is determined by the size of Q and the fragmentation of G, independent of the size of G; and (3) the response time is decided by the largest fragment of G rather than the entire G. In addition, we show that these algorithms can be readily implemented in the MapReduce framework. Using synthetic and real-life data, we experimentally verify that these algorithms are scalable on large graphs, regardless of how the graphs are distributed.

[1]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[2]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[3]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[4]  Jan Friso Groote,et al.  A Sub-quadratic Algorithm for Conjunctive and Disjunctive Boolean Equation Systems , 2005, ICTAC.

[5]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[6]  Attila Kiss,et al.  Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes , 2007, ADBIS Research Communications.

[7]  Andrew Lumsdaine,et al.  Lifting sequential graph algorithms for distributed-memory parallel computation , 2005, OOPSLA '05.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Wenfei Fan,et al.  Using partial evaluation in distributed query evaluation , 2006, VLDB.

[10]  Uri Zwick,et al.  Exact and Approximate Distances in Graphs - A Survey , 2001, ESA.

[11]  Daniel Deutch,et al.  Querying and monitoring distributed business processes , 2008, Proc. VLDB Endow..

[12]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[13]  Matthew Rowe,et al.  Interlinking Distributed Social Graphs , 2009, LDOW.

[14]  Mohsen Jamali A Distributed Method for Trust-Aware Recommendation in Social Networks , 2010, ArXiv.

[15]  Jeffrey Xu Yu,et al.  Graph Reachability Queries: A Survey , 2010, Managing and Mining Graph Data.

[16]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[17]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[18]  Bhavani M. Thuraisingham,et al.  Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce , 2009, CloudCom.

[19]  Thomas Wilke,et al.  Translating Regular Expressions into Small epsilon-Free Nondeterministic Finite Automata , 1997, STACS.

[20]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[21]  Dan Suciu,et al.  Distributed query evaluation on semistructured data , 2002, TODS.

[22]  Alex Thomo,et al.  Fault-tolerant computation of distributed regular path queries , 2009, Theor. Comput. Sci..

[23]  Neil D. Jones,et al.  An introduction to partial evaluation , 1996, CSUR.

[24]  Pablo Rodriguez,et al.  Divide and Conquer: Partitioning Online Social Networks , 2009, ArXiv.

[25]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[26]  Filippo Furfaro,et al.  A Framework for the Partial Evaluation of SPARQL Queries , 2008, SUM.

[27]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[28]  Wenfei Fan,et al.  Distributed query evaluation with performance guarantees , 2007, SIGMOD '07.

[29]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[30]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .