Separating the WHEAT from the Chaff: An Empirical Design for Geo-Replicated State Machines

State machine replication is a fundamental technique for implementing consistent fault-tolerant services. In the last years, several protocols have been proposed for improving the latency of this technique when the replicas are deployed in geographically-dispersed locations. In this work we evaluate some representative optimizations proposed in the literature by implementing them on an open-source state machine replication library and running the experiments in geographically-diverse PlanetLab nodes and Amazon EC2 regions. Interestingly, our results show that some optimizations widely used for improving the latency of geo-replicated state machines do not bring significant benefits, while others - not yet considered in this context - are very effective. Based on this evaluation, we propose WHEAT, a configurable crash and Byzantine fault-tolerant state machine replication library that uses the optimizations we observed as most effective in reducing SMR latency. WHEAT employs novel voting assignment schemes that, by using few additional spare replicas, enables the system to make progress without needing to access a majority of replicas. Our evaluation shows that a WHEAT system deployed in several Amazon EC2 regions presents a median latency up to 56% lower than a "normal" SMR protocol.

[1]  Alysson Neves Bessani,et al.  From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation , 2012, 2012 Ninth European Dependable Computing Conference.

[2]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[3]  Jonathan Kirsch,et al.  Scaling Byzantine Fault-Tolerant Replication toWide Area Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[6]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[7]  Jehan-François Pâris,et al.  Voting with Witnesses: A Constistency Scheme for Replicated Files , 1986, ICDCS.

[8]  Tushar Deepak Chandra,et al.  Paxos Made Live - An Engineering Perspective (2006 Invited Talk) , 2007 .

[9]  Hector Garcia-Molina,et al.  How to assign votes in a distributed system , 1985, JACM.

[10]  Idit Keidar,et al.  On the Performance of Quorum Replication on the Internet , 2008 .

[11]  Dan Dobre,et al.  HP: Hybrid Paxos for WANs , 2010, 2010 European Dependable Computing Conference.

[12]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[13]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[14]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[15]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[16]  John Lane,et al.  Steward: Scaling Byzantine Fault-Tolerant Replication to Wide Area Networks , 2010, IEEE Transactions on Dependable and Secure Computing.

[17]  Danny Dolev,et al.  Evaluating Total Order Algorithms in WAN , 2003 .

[18]  Keith Marzullo,et al.  Classic Paxos vs. fast Paxos: caveat emptor , 2007 .

[19]  Fernando Pedone,et al.  Genuine versus Non-Genuine Atomic Multicast Protocols for Wide Area Networks: An Empirical Study , 2009, 2009 28th IEEE International Symposium on Reliable Distributed Systems.

[20]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[21]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[22]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[23]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[24]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[25]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[26]  Miguel Correia,et al.  EBAWA: Efficient Byzantine Agreement for Wide-Area Networks , 2010, 2010 IEEE 12th International Symposium on High Assurance Systems Engineering.

[27]  Yair Amir,et al.  Evaluating quorum systems over the Internet , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[28]  Michael K. Reiter,et al.  Byzantine quorum systems , 1997, STOC '97.

[29]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[30]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[31]  André Schiper,et al.  Modeling and Validating the Performance of Atomic Broadcast Algorithms in High Latency Networks , 2007, Euro-Par.

[32]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[33]  Piotr Zieliński,et al.  Paxos at war , 2004 .

[34]  Idit Keidar,et al.  Evaluating the running time of a communication round over the internet , 2002, PODC '02.

[35]  Alysson Neves Bessani,et al.  State Machine Replication for the Masses with BFT-SMART , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[36]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[37]  Fernando Pedone,et al.  Geo-replicated storage with scalable deferred update replication , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[38]  Wilhelm Hasselbring,et al.  Availability of Globally Distributed Nodes: An Empirical Evaluation , 2008, 2008 Symposium on Reliable Distributed Systems.

[39]  Elias Procópio Duarte,et al.  Finding stable cliques of PlanetLab nodes , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).