Efficient query evaluation on distributed graphs with Hadoop environment

Graph has emerged as a powerful data structure to describe various data. Query evaluation on distributed graphs takes much cost due to the complexity of links among sites. Dan Suciu has proposed algorithms for query evaluation on semistructured data that is a rooted, edge-labeled graph, and algorithms are proved to be efficient in terms of communication steps and data transferring during the evaluation. However, one disadvantage is that communication data are collected to one single site, which leads to a bottleneck in the evaluation for real-life data. In this paper, we propose two algorithms to improve Dan Suciu's algorithms: one-pass algorithm is to significantly reduce a large amount of redundant data in the evaluation, and iter_acc algorithm is to resolve the bottleneck. Then, we design an efficient implementation with only one MapReduce job for our algorithms in Hadoop environment by utilizing features of Hadoop file system. Experiments on cloud system show that one-pass algorithm can detect and remove 50% of data being redundant in the evaluation process on YouTube and DBLP datasets, and iter_acc algorithm is running without the bottleneck even when we double the size of input data.

[1]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[4]  Alex Thomo,et al.  Fault-tolerant computation of distributed regular path queries , 2009, Theor. Comput. Sci..

[5]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[6]  Dan Suciu,et al.  Distributed query evaluation on semistructured data , 2002, TODS.

[7]  Abraham Bernstein,et al.  Signal/Collect: Graph Algorithms for the (Semantic) Web , 2010, SEMWEB.

[8]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[9]  Xin Wang,et al.  Performance Guarantees for Distributed Reachability Queries , 2012, Proc. VLDB Endow..

[10]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[11]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[12]  Kevin J. Lang Finding good nearly balanced cuts in power law graphs , 2004 .

[13]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[14]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[15]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[16]  Yoshihiko Futamura,et al.  Partial Evaluation of Computation Process--An Approach to a Compiler-Compiler , 1999, High. Order Symb. Comput..

[17]  Dan Suciu,et al.  UnQL: a query language and algebra for semistructured data based on structural recursion , 2000, The VLDB Journal.

[18]  Lars Backstrom,et al.  The Anatomy of the Facebook Social Graph , 2011, ArXiv.

[19]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Simon Otjes,et al.  The Netherlands: The Netherlands , 2010 .

[22]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.