On Fault Tolerance for Distributed Iterative Dataflow Processing

Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and experimental analyses using Apache Flink and discover that they outperform blocking checkpointing and complete recovery.

[1]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[2]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[3]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[4]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[5]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Gang Chen,et al.  Fast Failure Recovery in Distributed Graph Processing Systems , 2014, Proc. VLDB Endow..

[8]  Luke M. Leslie,et al.  Zorro: zero-cost reactive failure recovery in distributed graph processing , 2015, SoCC.

[9]  Volker Markl,et al.  "All roads lead to Rome": optimistic recovery for distributed iterative data processing , 2013, CIKM.

[10]  Chen Xu,et al.  Efficient fault-tolerance for iterative graph processing on distributed dataflow systems , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[11]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[12]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[16]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[17]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[18]  Carsten Binnig,et al.  Cost-based Fault-tolerance for Parallel Data Processing , 2015, SIGMOD Conference.

[19]  Fan Yang,et al.  Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing , 2016, ArXiv.

[20]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[21]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[22]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[23]  Magdalena Balazinska,et al.  A latency and fault-tolerance optimizer for online parallel query plans , 2011, SIGMOD '11.

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  Chen Xu,et al.  Optimistic Recovery for Iterative Dataflows in Action , 2015, SIGMOD Conference.

[26]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[27]  Michael J. Carey,et al.  Pregelix: Big(ger) Graph Analytics on a Dataflow Engine , 2014, Proc. VLDB Endow..

[28]  Volker Markl,et al.  Implicit Parallelism through Deep Language Embedding , 2016, SGMD.

[29]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[30]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.