Analysis of checkpointing overhead in parallel state machine replication

State machine replication (SMR) is a well-established technique to fault-tolerant systems. In part, this is explained by the simplicity of the approach and its strong consistency guarantees. Recently, several proposals have suggested parallelizing the execution of state machine replicas to achieve high throughput. Concurrent execution of commands has many implications, including the recovery of replicas from failures. Conventional checkpointing techniques, for example, must be revisited in parallelized models. In this paper, we review parallel variations of state machine replication and discuss how checkpointing procedures apply to these models. Moreover, we evaluate the impact caused by checkpointing techniques on recovery through simulations.

[1]  Peng Li,et al.  Paxos Replicated State Machines as the Basis of a High-Performance Data Store , 2011, NSDI.

[2]  Fernando Pedone,et al.  Rethinking State-Machine Replication for Parallelism , 2013, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[3]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Fernando Pedone,et al.  Checkpointing in Parallel State-Machine Replication , 2014, OPODIS.

[6]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[7]  Alan Fekete,et al.  Consistency Models for Replicated Data , 2010, Replication.

[8]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[9]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[10]  Tushar Deepak Chandra,et al.  Paxos Made Live - An Engineering Perspective (2006 Invited Talk) , 2007 .

[11]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[12]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[13]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[14]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[15]  Ramakrishna Kotla,et al.  High throughput Byzantine fault tolerance , 2004, International Conference on Dependable Systems and Networks, 2004.

[16]  Miguel Correia,et al.  On the Efficiency of Durable State Machine Replication , 2013, USENIX Annual Technical Conference.