Efficient replication via timestamp stability

Modern web applications replicate their data across the globe and require strong consistency guarantees for their most critical data. These guarantees are usually provided via statemachine replication (SMR). Recent advances in SMR have focused on leaderless protocols, which improve the availability and performance of traditional Paxos-based solutions. We propose Tempo – a leaderless SMR protocol that, in comparison to prior solutions, achieves superior throughput and offers predictable performance even in contended workloads. To achieve these benefits, Tempo timestamps each application command and executes it only after the timestamp becomes stable, i.e., all commands with a lower timestamp are known. Both the timestamping and stability detection mechanisms are fully decentralized, thus obviating the need for a leader replica. Our protocol furthermore generalizes to partial replication settings, enabling scalability in highly parallel workloads. We evaluate the protocol in both real and simulated geo-distributed environments and demonstrate that it outperforms state-of-the-art alternatives. CCSConcepts: •Theory of computation→Distributed algorithms.

[1]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[2]  Fernando Pedone,et al.  Scalable State-Machine Replication , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[3]  Alexey Gotsman,et al.  State-machine replication for planet-scale systems , 2020, EuroSys.

[4]  Fernando Pedone,et al.  Clock-RSM: Low-Latency Inter-datacenter State Machine Replication Using Loosely Synchronized Physical Clocks , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[5]  Fernando Pedone,et al.  P-Store: Genuine Partial Replication in Wide Area Networks , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.

[6]  Boris Grot,et al.  Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol , 2020, ASPLOS.

[7]  Barbara Liskov,et al.  Viewstamped Replication: A General Primary Copy , 1988, PODC.

[8]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[9]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[10]  Michael Whittaker,et al.  Bipartisan Paxos: A Modular State Machine Replication Protocol , 2020, ArXiv.

[11]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[12]  Edouard Bugnion,et al.  HovercRaft: achieving scalability and fault-tolerance for microsecond-scale datacenter services , 2020, EuroSys.

[13]  Pascal Felber,et al.  ZooFence: Principled Service Partitioning and Application to the ZooKeeper Coordination Service , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[14]  Roberto Palmieri,et al.  Speeding up Consensus by Chasing Fast Decisions , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Rachid Guerraoui,et al.  Genuine atomic multicast in asynchronous distributed systems , 2001, Theor. Comput. Sci..

[17]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[18]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[19]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[20]  Daniel J. Abadi,et al.  The case for determinism in database systems , 2010, Proc. VLDB Endow..

[21]  Shuai Mu,et al.  The SNOW Theorem and Latency-Optimal Read-Only Transactions , 2016, OSDI.

[22]  Barbara Liskov,et al.  Granola: Low-Overhead Distributed Transaction Coordination , 2012, USENIX Annual Technical Conference.

[23]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[24]  Qian Lin,et al.  PaxosStore: High-availability Storage Made Practical in WeChat , 2017, Proc. VLDB Endow..

[25]  Irfan Sharif,et al.  CockroachDB: The Resilient Geo-Distributed SQL Database , 2020, SIGMOD Conference.

[26]  Wyatt Lloyd,et al.  Gryff: Unifying Consensus and Shared Registers , 2020, NSDI.

[27]  Roberto Palmieri,et al.  Be General and Don't Give Up Consistency in Geo-Replicated Transactional Systems , 2014, OPODIS.

[28]  Alan Fekete,et al.  YCSB+T: Benchmarking web-scale transactional databases , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[29]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[30]  Divyakant Agrawal,et al.  Partial database replication using epidemic communication , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[31]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[32]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[33]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[34]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[35]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[36]  Dahlia Malkhi,et al.  Flexible Paxos: Quorum Intersection Revisited , 2016, OPODIS.

[37]  Pierre Sutra,et al.  Leaderless State-Machine Replication: Specification, Properties, Limits , 2020, DISC.

[38]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[39]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[40]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[41]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[42]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[43]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[44]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[45]  Jinyang Li,et al.  Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.