An Optimal Checkpointing Model with Online OCI Adjustment for Stream Processing Applications

Checkpoint-based fault tolerant method has been widely used to enhance the reliability of Distributed Stream Processing Engines (DSPEs), but a checkpointing process usually introduces considerable overhead. It is a critical issue to choose the Optimal Checkpoint Interval (OCI) that maximizes the processing efficiency. Traditional OCI models consider the recovery time only related to the execution time from the last checkpoint to the moment of the failure. They are not suitable for stream processing jobs because the recovery time is related to the reprocessing workload, which depends on the realtime input data before a failure. A new model is needed to choose the OCI for stream processing applications. Moreover, the input data rate of an stream processing job fluctuates over time. The OCI of an application should also be adjusted dynamically according to the input workload. To solve these problems, we present a novel DSPS Optimal Checkpoint Interval (DOCI) model in this paper. We prove that it maximizes the processing efficiency for a given time period. We propose an approach to dynamically adjust the OCI for an application to accommodate the realtime workload fluctuations. We conduct simulation experiments to verify the effectiveness of DOCI model and the efficiency of the online OCI adjustment algorithm. Experimental results with a real-world dataset show DOCI achieves an improvement on system efficiency by up to 40%, comparing with existing fault-tolerant approaches.

[1]  Xiang Li,et al.  Task Allocation for Stream Processing with Recovery Latency Guarantee , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Ping Huang,et al.  Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[3]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[4]  Aoying Zhou,et al.  Parallel Stream Processing Against Workload Skewness and Variance , 2017, HPDC.

[5]  Thomas S. Heinze,et al.  An adaptive replication scheme for elastic data stream processing systems , 2015, DEBS.

[6]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[7]  Stephen L. Scott,et al.  A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.

[8]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[9]  Kostas Magoutis,et al.  CEC: Continuous eventual checkpointing for data stream processing operators , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[10]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Magdalena Balazinska,et al.  A latency and fault-tolerance optimizer for online parallel query plans , 2011, SIGMOD '11.

[12]  K. Venkatesh,et al.  Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications , 2010 .

[13]  Carsten Binnig,et al.  Cost-based Fault-tolerance for Parallel Data Processing , 2015, SIGMOD Conference.

[14]  Xian-He Sun,et al.  Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.

[15]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[16]  Mun Choon Chan,et al.  Meteor Shower: A Reliable Stream Processing System for Commodity Data Centers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Shangping Ren,et al.  Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications , 2009, SAC '09.

[18]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[19]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[20]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[21]  Asterios Katsifodimos,et al.  Apache Flink: Stream Analytics at Scale , 2016, 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW).

[22]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[23]  Andrey Brito,et al.  Active Replication at (Almost) No Cost , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[24]  Rajkumar Buyya,et al.  Distributed data stream processing and edge computing: A survey on resource elasticity and future directions , 2017, J. Netw. Comput. Appl..

[25]  Jean-Marc Vincent,et al.  A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.

[26]  Xiang Li,et al.  Integrated recovery and task allocation for stream processing , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[27]  Xiang Li,et al.  Minimum Backups for Stream Processing With Recovery Latency Guarantees , 2017, IEEE Transactions on Reliability.

[28]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[29]  Jianneng Cao,et al.  Integrative Dynamic Reconfiguration in a Parallel Stream Processing Engine , 2016, ICDE.

[30]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[31]  Matei A. Zaharia,et al.  An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[32]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.