Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing

Billion-node graphs are rapidly growing in size in many applications such as online social networks. Most graph algorithms generate a large number of messages during iterative computations. Vertex-centric distributed systems usually store graph data and message data on disk to improve scalability. Currently, these distributed systems with disk-resident data take a push-based approach to handle messages. This works well if few messages reside on disk. Otherwise, it is I/O-inefficient due to expensive random writes. By contrast, the existing memory-resident pull-based approach individually pulls messages for each vertex on demand. Although it can be used to avoid disk operations regarding messages, expensive I/O costs are incurred by random and frequent access to vertices. This paper proposes a hybrid solution to support switching between push and pull adaptively, to obtain optimal performance for distributed systems with disk-resident data in different scenarios. We first employ a new block-centric technique (b-pull) to improve the I/O-performance of pulling messages, although the iterative computation is vertex-centric. I/O costs of data accesses are shifted from the receiver side where messages are written/read by push to the sender side where graph data are read by b-pull. Graph data are organized by clustering vertices and edges to achieve high I/O-efficiency in b-pull. Second, we design a seamless switching mechanism and a prominent performance prediction method to guarantee efficiency when switching between push and b-pull. We conduct extensive performance studies to confirm the effectiveness of our proposals over existing up-to-date solutions using a broad spectrum of real-world graphs.

[1]  Jie Yan,et al.  GRE: A Graph Runtime Engine for Large-Scale Distributed Graph-Parallel Applications , 2013, ArXiv.

[2]  Jimeng Sun,et al.  GBASE: a scalable and general graph management system , 2011, KDD.

[3]  Gabriel Kliot,et al.  Streaming graph partitioning for large distributed graphs , 2012, KDD.

[4]  Indranil Gupta,et al.  LFGraph: simple and fast distributed graph analytics , 2013, TRIOS@SOSP.

[5]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[6]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[7]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[8]  Wilfred Ng,et al.  Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation , 2015, WWW.

[9]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[10]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[12]  Enhong Chen,et al.  Kineograph: taking the pulse of a fast-changing and connected world , 2012, EuroSys '12.

[13]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[14]  Murat Demirbas,et al.  Giraphx: Parallel Yet Serializable Large-Scale Graph Processing , 2013, Euro-Par.

[15]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[16]  Pangfeng Liu,et al.  Kylin: An efficient and scalable graph data processing system , 2013, 2013 IEEE International Conference on Big Data.

[17]  Wilfred Ng,et al.  Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs , 2014, Proc. VLDB Endow..

[18]  Chang Zhou,et al.  MOCgraph: Scalable Distributed Graph Processing Using Message Online Computing , 2014, Proc. VLDB Endow..

[19]  Panos Kalnis,et al.  Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[20]  Jeffrey Xu Yu,et al.  Catch the Wind: Graph workload balancing on cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[21]  Bingsheng He,et al.  Large graph processing in the cloud , 2010, SIGMOD Conference.

[22]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[23]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[24]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[25]  Shirish Tatikonda,et al.  From "Think Like a Vertex" to "Think Like a Graph" , 2013, Proc. VLDB Endow..

[26]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[27]  Wenguang Chen,et al.  Chronos: a graph engine for temporal graph analysis , 2014, EuroSys '14.

[28]  Lixin Gao,et al.  Scalable Distributed Belief Propagation with Prioritized Block Updates , 2014, CIKM.

[29]  Yafei Dai,et al.  Seraph: an efficient, low-cost system for concurrent graph processing , 2014, HPDC '14.