An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

In recent year, many large-scale iterative graph computation systems such as Pregel have been developed. To ensure that these systems are fault-tolerant, checkpointing, which archives graph states onto distributed file systems periodically, has been proposed. However, fault-tolerance remains to be challenging because the whole data set is archived with a static interval, rendering underlying graph computations to entail I/O-costs in terms of disk and network communication. Motivated by this, we first propose to dynamically adjust checkpoint intervals based on a carefully designed cost-analysis model, by taking the underlying computing workload into account. Furthermore, for algorithms that can be restarted from any point during computations, we prioritize graph states and then checkpointing can be performed with selected data, instead of the entire dataset, to reduce archiving overhead while simultaneously guaranteeing the failure recovery efficiency. Finally, we conduct extensive performance studies to confirm the effectiveness of our approaches over existing up-to-date solutions using a broad spectrum of real-world graphs.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[3]  Haibo Chen,et al.  SYNC or ASYNC: time to fuse for distributed graph-parallel computation , 2015, PPoPP.

[4]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[5]  Ge Yu,et al.  Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing , 2016, SIGMOD Conference.

[6]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[8]  Gang Chen,et al.  Fast Failure Recovery in Distributed Graph Processing Systems , 2014, Proc. VLDB Endow..

[9]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[10]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[11]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12]  Yafei Dai,et al.  Seraph: an efficient, low-cost system for concurrent graph processing , 2014, HPDC '14.

[13]  Chang Zhou,et al.  MOCgraph: Scalable Distributed Graph Processing Using Message Online Computing , 2014, Proc. VLDB Endow..

[14]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[15]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[16]  Volker Markl,et al.  "All roads lead to Rome": optimistic recovery for distributed iterative data processing , 2013, CIKM.

[17]  Panos Kalnis,et al.  Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[18]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[19]  Shirish Tatikonda,et al.  From "Think Like a Vertex" to "Think Like a Graph" , 2013, Proc. VLDB Endow..

[20]  Jeffrey Xu Yu,et al.  Catch the Wind: Graph workload balancing on cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[21]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[22]  Lixin Gao,et al.  Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation , 2017, 1710.05785.

[23]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[24]  Wilfred Ng,et al.  Blogel: A Block-Centric Framework for Distributed Computation on Real-World Graphs , 2014, Proc. VLDB Endow..

[25]  Ge Yu,et al.  A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments , 2018, IEEE Trans. Parallel Distributed Syst..

[26]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[27]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[28]  Luke M. Leslie,et al.  Zorro: zero-cost reactive failure recovery in distributed graph processing , 2015, SoCC.

[29]  Fabian Hueske,et al.  Apache Flink , 2019, Encyclopedia of Big Data Technologies.

[30]  Chen Xu,et al.  Efficient fault-tolerance for iterative graph processing on distributed dataflow systems , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[31]  M. Tamer Özsu,et al.  An Experimental Comparison of Pregel-like Graph Processing Systems , 2014, Proc. VLDB Endow..