Scalable State-Machine Replication

State machine replication (SMR) is a well-known technique able to provide fault-tolerance. SMR consists of sequencing client requests and executing them against replicas in the same order, thanks to deterministic execution, every replica will reach the same state after the execution of each request. However, SMR is not scalable since any replica added to the system will execute all requests, and so throughput does not increase with the number of replicas. Scalable SMR (S-SMR) addresses this issue in two ways: (i) by partitioning the application state, while allowing every command to access any combination of partitions, and (ii) by using a caching algorithm to reduce the communication across partitions. We describe Eyrie, a library in Java that implements S-SMR, and Volery, an application that implements Zookeeper's API. We assess the performance of Volery and compare the results against Zookeeper. Our experiments show that Volery scales throughput with the number of partitions.

[1]  Fernando Pedone,et al.  Optimistic Parallel State-Machine Replication , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[2]  Rachid Guerraoui,et al.  The Database State Machine Approach , 2003, Distributed and Parallel Databases.

[3]  S. S. Ravi,et al.  Deferred updates and data placement in distributed databases , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[4]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[5]  André Schiper,et al.  Optimistic active replication , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[6]  Fernando Pedone,et al.  Parallel Deferred Update Replication , 2014, 2014 IEEE 13th International Symposium on Network Computing and Applications.

[7]  Fernando Pedone,et al.  Rethinking State-Machine Replication for Parallelism , 2013, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[8]  Nicolas Schiper On Multicast Primitives in Large Networks and Partial Replication Protocols , 2009 .

[9]  Ramakrishna Kotla,et al.  High throughput Byzantine fault tolerance , 2004, International Conference on Dependable Systems and Networks, 2004.

[10]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[11]  Marcin Paprzycki,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 2001, Scalable Comput. Pract. Exp..

[12]  Allan Kuchinsky,et al.  Quality is in the eye of the beholder: meeting users' requirements for Internet quality of service , 2000, CHI.

[13]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[14]  Robbert van Renesse,et al.  Byzantine Chain Replication , 2012, OPODIS.

[15]  Flavio Paiva Junqueira,et al.  Scalable Agreement: Toward Ordering as a Service , 2010, HotDep.

[16]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[17]  Fernando Pedone,et al.  Multi-Ring Paxos , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[18]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[19]  U. Fritzke,et al.  Transactions on partially replicated data based on reliable and atomic multicasts , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[20]  Fernando Pedone,et al.  Scalable deferred update replication , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[21]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[22]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[23]  André Schiper,et al.  Optimistic Atomic Broadcast , 1998, DISC.

[24]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[25]  Arun Venkataramani,et al.  Separating agreement from execution for byzantine fault tolerant services , 2003, SOSP '03.

[26]  Gustavo Alonso,et al.  Improving the scalability of fault-tolerant database clusters , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[27]  Fernando Pedone,et al.  Partial replication in the Database State Machine , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[28]  Pawel T. Wojciechowski,et al.  Hybrid Replication: State-Machine-Based and Deferred-Update Replication Schemes Combined , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[29]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[30]  Fernando Pedone,et al.  Building global and scalable systems with atomic multicast , 2014, Middleware.

[31]  Gustavo Alonso,et al.  Processing transactions over optimistic atomic broadcast protocols , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[32]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[33]  Rachid Guerraoui,et al.  A High Throughput Atomic Storage Algorithm , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[34]  Fernando Pedone,et al.  Optimistic Atomic Multicast , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[35]  Fernando Pedone,et al.  Ridge: High-Throughput, Low-Latency Atomic Multicast , 2015, 2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS).

[36]  Fernando Pedone,et al.  Università Della Svizzera Italiana Usi Technical Report Series in Informatics Ram-dur: In-memory Deferred Update Replication , 2022 .

[37]  Rachid Guerraoui,et al.  Genuine atomic multicast in asynchronous distributed systems , 2001, Theor. Comput. Sci..

[38]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[39]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[40]  Hector Garcia-Molina,et al.  Ordered and reliable multicast communication , 1991, TOCS.

[41]  Fernando Pedone,et al.  High performance state-machine replication , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[42]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[43]  Alysson Neves Bessani,et al.  From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation , 2012, 2012 Ninth European Dependable Computing Conference.

[44]  Marcos K. Aguilera,et al.  Efficient atomic broadcast using deterministic merge , 2000, PODC '00.

[45]  Yi Lin,et al.  Snapshot isolation and integrity constraints in replicated databases , 2009, TODS.

[46]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[47]  Fernando Pedone,et al.  Geo-replicated storage with scalable deferred update replication , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[48]  André Schiper,et al.  Achieving High-Throughput State Machine Replication in Multi-core Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[49]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[50]  Claudiu Danilov,et al.  The Spread Toolkit: Architecture and Performance , 2004 .

[51]  David Wetherall,et al.  Demystifying Page Load Performance with WProf , 2013, NSDI.

[52]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[53]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[54]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[55]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[56]  Jonathan M. Smith,et al.  Effects of Copy-on-Write Memory Management on the Response Time of UNIX Fork Operations , 1988, Comput. Syst..

[57]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[58]  Ivan Beschastnikh,et al.  Scalable consistency in Scatter , 2011, SOSP.

[59]  Francisco Moura,et al.  Optimistic total order in wide area networks , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[60]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[61]  André Schiper,et al.  Scalable atomic multicast , 1998, Proceedings 7th International Conference on Computer Communications and Networks (Cat. No.98EX226).

[62]  Patrick E. O'Neil,et al.  Generalized isolation level definitions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[63]  Allan Kuchinsky,et al.  Integrating user-perceived quality into Web server design , 2000, Comput. Networks.

[64]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[65]  André Schiper,et al.  S-Paxos: Offloading the Leader for High Throughput State Machine Replication , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[66]  Fernando Pedone,et al.  On the Inherent Cost of Atomic Broadcast and Multicast in Wide Area Networks , 2008, ICDCN.

[67]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[68]  Rachid Guerraoui,et al.  Throughput optimal total order broadcast for cluster environments , 2010, TOCS.

[69]  Fernando Pedone,et al.  Ring Paxos: A high-throughput atomic broadcast protocol , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[70]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.