A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments

Most graph algorithms are iterative in nature. They can be processed by distributed systems in memory in an efficient asynchronous manner. However, it is challenging to recover from failures in such systems. This is because traditional checkpoint fault-tolerant frameworks incur expensive barrier costs that usually offset the gains brought by asynchronous computations. Worse, surviving data are rolled back, leading to costly re-computations. This paper first proposes to leverage surviving data for failure recovery in an asynchronous system. Our framework guarantees the correctness of algorithms and avoids rolling back surviving data. Additionally, a novel asynchronous checkpointing solution is introduced to accelerate recovery at the price of nearly zero overheads. Some optimization strategies like message pruning, non-blocking recovery and load balancing are also designed to further boost the performance. We have conducted extensive experiments to show the effectiveness of our proposals using real-world graphs.

[1]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[2]  Lixin Gao,et al.  Scalable Distributed Belief Propagation with Prioritized Block Updates , 2014, CIKM.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Chang Zhou,et al.  MOCgraph: Scalable Distributed Graph Processing Using Message Online Computing , 2014, Proc. VLDB Endow..

[5]  Gabriel Kliot,et al.  Streaming graph partitioning for large distributed graphs , 2012, KDD.

[6]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[7]  Yafei Dai,et al.  Seraph: an efficient, low-cost system for concurrent graph processing , 2014, HPDC '14.

[8]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[9]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[10]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[11]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[12]  Ming Mao,et al.  A Performance Study on the VM Startup Time in the Cloud , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[13]  Lixin Gao,et al.  Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation , 2017, 1710.05785.

[14]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[15]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[16]  Indranil Gupta,et al.  LFGraph: simple and fast distributed graph analytics , 2013, TRIOS@SOSP.

[17]  Ge Yu,et al.  A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments , 2018, IEEE Trans. Parallel Distributed Syst..

[18]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[19]  Panos Kalnis,et al.  Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[20]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  Volker Markl,et al.  "All roads lead to Rome": optimistic recovery for distributed iterative data processing , 2013, CIKM.

[23]  Franck Cappello,et al.  Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications , 2015, 2015 IEEE International Conference on Cluster Computing.

[24]  George Karypis,et al.  Multi-threaded Graph Partitioning , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[25]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[26]  Gang Chen,et al.  Evaluating geo-social influence in location-based social networks , 2012, CIKM.

[27]  Luke M. Leslie,et al.  Zorro: zero-cost reactive failure recovery in distributed graph processing , 2015, SoCC.

[28]  Nancy M. Amato,et al.  Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Ge Yu,et al.  Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing , 2016, SIGMOD Conference.

[30]  Gang Chen,et al.  Fast Failure Recovery in Distributed Graph Processing Systems , 2014, Proc. VLDB Endow..

[31]  Thomas Hérault,et al.  Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[32]  Fan Yang,et al.  Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing , 2016, ArXiv.

[33]  Magdalena Balazinska,et al.  Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines , 2015, Proc. VLDB Endow..

[34]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[35]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[36]  Lixin Gao,et al.  Asynchronous Distributed Incremental Computation on Evolving Graphs , 2016, ECML/PKDD.

[37]  Chen Xu,et al.  Efficient fault-tolerance for iterative graph processing on distributed dataflow systems , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[38]  M. Tamer Özsu,et al.  An Experimental Comparison of Pregel-like Graph Processing Systems , 2014, Proc. VLDB Endow..

[39]  Qing Zhang,et al.  Assessing and ranking structural correlations in graphs , 2011, SIGMOD '11.