Efficient incremental checkpoint algorithm for primary-backup replication

Replication protocols are widely used for enabling fault tolerance and reliability features in distributed systems aiming fast recovery and seamless transition. In this study, we propose an efficient incremental checkpoint algorithm for primary-backup replication protocols to increase the system thro­ughput. We developed an in-memory key-value store configured by the primary-backup replication protocol and set it up on the geographically distributed nodes of the PlanetLab overlay network. We performed measurements for metrics of interest on both the client and the primary replica side. Our findings show that the proposed incremental checkpoint algorithm not only assures 2–3 times lower average blocking times but also guarantees a near-steady minimum average blocking time.

[1]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[2]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, IPDPS.

[3]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[4]  Dejan S. Milojicic,et al.  Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[6]  Gustavo M. D. Vieira,et al.  Distributed Checkpointing: Analysis and Benchmarks , 2007 .

[7]  Zizhong Chen,et al.  Multilevel Diskless Checkpointing , 2013, IEEE Transactions on Computers.

[8]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[9]  Hrushikesha Mohanty,et al.  A survey on checkpointing web services , 2014, PESOS 2014.

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Xavier Defago,et al.  AGREEMENT-RELATED PROBLEMS: FROM SEMI-PASSIVE REPLICATION TO TOTALLY ORDERED BROADCAST , 2000 .

[13]  Kalyan S. Perumalla,et al.  Introduction to Reversible Computing , 2013 .

[14]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[15]  Ajay D. Kshemkalyani,et al.  Distributed Computing: Principles, Algorithms, and Systems , 2008 .

[16]  Ben Margolis,et al.  SOA for the Business Developer: Concepts, BPEL, and SCA (Business Developers series) , 2007 .

[17]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[18]  Ron Brightwell,et al.  On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.