论文信息 - Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing

Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing

Billion-node graphs are rapidly growing in size in many applications such as online social networks. Most graph algorithms generate a large number of messages during iterative computations. Vertex-centric distributed systems usually store graph data and message data on disk to improve scalability. Currently, these distributed systems with disk-resident data take a push-based approach to handle messages. This works well if few messages reside on disk. Otherwise, it is I/O-inefficient due to expensive random writes. By contrast, the existing memory-resident pull-based approach individually pulls messages for each vertex on demand. Although it can be used to avoid disk operations regarding messages, expensive I/O costs are incurred by random and frequent access to vertices. This paper proposes a hybrid solution to support switching between push and pull adaptively, to obtain optimal performance for distributed systems with disk-resident data in different scenarios. We first employ a new block-centric technique (b-pull) to improve the I/O-performance of pulling messages, although the iterative computation is vertex-centric. I/O costs of data accesses are shifted from the receiver side where messages are written/read by push to the sender side where graph data are read by b-pull. Graph data are organized by clustering vertices and edges to achieve high I/O-efficiency in b-pull. Second, we design a seamless switching mechanism and a prominent performance prediction method to guarantee efficiency when switching between push and b-pull. We conduct extensive performance studies to confirm the effectiveness of our proposals over existing up-to-date solutions using a broad spectrum of real-world graphs.

[1] Jie Yan,et al. GRE: A Graph Runtime Engine for Large-Scale Distributed Graph-Parallel Applications , 2013, ArXiv.

[2] Jimeng Sun,et al. GBASE: a scalable and general graph management system , 2011, KDD.

[3] Gabriel Kliot,et al. Streaming graph partitioning for large distributed graphs , 2012, KDD.

[4] Indranil Gupta,et al. LFGraph: simple and fast distributed graph analytics , 2013, TRIOS@SOSP.

[5] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[6] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[7] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[8] Wilfred Ng,et al. Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation , 2015, WWW.

[9] Jennifer Widom,et al. GPS: a graph processing system , 2013, SSDBM.

[10] Réka Albert,et al. Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11] Haixun Wang,et al. Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[12] Enhong Chen,et al. Kineograph: taking the pulse of a fast-changing and connected world , 2012, EuroSys '12.

[13] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[14] Murat Demirbas,et al. Giraphx: Parallel Yet Serializable Large-Scale Graph Processing , 2013, Euro-Par.

[15] Michael D. Ernst,et al. HaLoop , 2010, Proc. VLDB Endow..

[16] Pangfeng Liu,et al. Kylin: An efficient and scalable graph data processing system , 2013, 2013 IEEE International Conference on Big Data.

[17] Wilfred Ng,et al. Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs , 2014, Proc. VLDB Endow..

[18] Chang Zhou,et al. MOCgraph: Scalable Distributed Graph Processing Using Message Online Computing , 2014, Proc. VLDB Endow..

[19] Panos Kalnis,et al. Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[20] Jeffrey Xu Yu,et al. Catch the Wind: Graph workload balancing on cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[21] Bingsheng He,et al. Large graph processing in the cloud , 2010, SIGMOD Conference.

[22] Reynold Xin,et al. GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[23] M. Abadi,et al. Naiad: a timely dataflow system , 2013, SOSP.

[24] Michael J. Carey,et al. Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[25] Shirish Tatikonda,et al. From "Think Like a Vertex" to "Think Like a Graph" , 2013, Proc. VLDB Endow..

[26] Christos Faloutsos,et al. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[27] Wenguang Chen,et al. Chronos: a graph engine for temporal graph analysis , 2014, EuroSys '14.

[28] Lixin Gao,et al. Scalable Distributed Belief Propagation with Prioritized Block Updates , 2014, CIKM.

[29] Yafei Dai,et al. Seraph: an efficient, low-cost system for concurrent graph processing , 2014, HPDC '14.