论文信息 - Efficient query evaluation on distributed graphs with Hadoop environment

Efficient query evaluation on distributed graphs with Hadoop environment

Graph has emerged as a powerful data structure to describe various data. Query evaluation on distributed graphs takes much cost due to the complexity of links among sites. Dan Suciu has proposed algorithms for query evaluation on semistructured data that is a rooted, edge-labeled graph, and algorithms are proved to be efficient in terms of communication steps and data transferring during the evaluation. However, one disadvantage is that communication data are collected to one single site, which leads to a bottleneck in the evaluation for real-life data. In this paper, we propose two algorithms to improve Dan Suciu's algorithms: one-pass algorithm is to significantly reduce a large amount of redundant data in the evaluation, and iter_acc algorithm is to resolve the bottleneck. Then, we design an efficient implementation with only one MapReduce job for our algorithms in Hadoop environment by utilizing features of Hadoop file system. Experiments on cloud system show that one-pass algorithm can detect and remove 50% of data being redundant in the evaluation process on YouTube and DBLP datasets, and iter_acc algorithm is running without the bottleneck even when we double the size of input data.

Zhenjiang Hu | Le-Duc Tung | Quyet Nguyen-Van

[1] Hosung Park,et al. What is Twitter, a social network or a news media? , 2010, WWW '10.

[2] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[4] Alex Thomo,et al. Fault-tolerant computation of distributed regular path queries , 2009, Theor. Comput. Sci..

[5] Dan Suciu,et al. A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[6] Dan Suciu,et al. Distributed query evaluation on semistructured data , 2002, TODS.

[7] Abraham Bernstein,et al. Signal/Collect: Graph Algorithms for the (Semantic) Web , 2010, SEMWEB.

[8] Andrei Z. Broder,et al. Graph structure in the Web , 2000, Comput. Networks.

[9] Xin Wang,et al. Performance Guarantees for Distributed Reachability Queries , 2012, Proc. VLDB Endow..

[10] Dan Suciu,et al. Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[11] Jin-Soo Kim,et al. HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.