Kite: efficient and available release consistency for the datacenter

Key-Value Stores (KVSs) came into prominence as highly-available, eventually consistent (EC), "NoSQL" Databases, but have quickly transformed into general-purpose, programmable storage systems. Thus, EC, while relevant, is no longer sufficient. Complying with the emerging requirements for stronger consistency, researchers have proposed KVSs with multiple consistency levels (MCL) that expose the consistency/performance trade-off to the programmer. We argue that this approach falls short in both programmability and performance. For instance, the MCL APIs proposed thus far, fail to capture the ordering relationship between strongly- and weakly-consistent accesses that naturally occur in programs. Taking inspiration from shared memory, we advocate Release Consistency (RC) for KVSs. We argue that RC's onesided barriers are ideal for capturing the ordering relationship between synchronization and non-synchronization accesses while enabling high-performance. We present Kite, the first highly-available, replicated KVS that offers a linearizable variant of RC for the asynchronous setting with individual process and network failures. Kite enforces RC barriers through a novel fast/slow path mechanism that leverages the absence of failures in the typical case to maximize performance while relying on the slow path for progress. Our evaluation shows that the RDMA-enabled and heavily-multithreaded Kite achieves orders of magnitude better performance than Derecho (a state-of-the-art RDMA-enabled state machine replication system) and significantly outperforms ZAB (the protocol at the heart of Zookeeper). We demonstrate the efficacy of Kite by porting three lock-free shared memory data structures, and showing that Kite outperforms the competition.

[1]  Cheng Wang,et al.  APUS: fast and scalable paxos on RDMA , 2017, SoCC.

[2]  Cheng Li,et al.  Making geo-replicated systems fast as possible, consistent when necessary , 2012, OSDI 2012.

[3]  João Leitão,et al.  ChainReaction: a causal+ consistent datastore based on chain replication , 2013, EuroSys '13.

[4]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[5]  Gadi Taubenfeld Shared Memory Synchronization , 2008, Bull. EATCS.

[6]  Stratis Viglas,et al.  DHTM: Durable Hardware Transactional Memory , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[7]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[8]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[9]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[10]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[11]  Krste Asanovic,et al.  The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[12]  R. Weisberg A-N-D , 2011 .

[13]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[14]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[15]  João Leitão,et al.  Automating the Choice of Consistency Levels in Replicated Systems , 2014, USENIX Annual Technical Conference.

[16]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[17]  Christoph Lameter,et al.  Effective Synchronization on Linux/NUMA Systems , 2005 .

[18]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[19]  Yunsup Lee,et al.  The RISC-V Instruction Set Manual , 2014 .

[20]  Maurice Herlihy,et al.  A persistent lock-free queue for non-volatile memory , 2018, PPoPP.

[21]  Robbert van Renesse,et al.  Derecho: Fast State Machine Replication for Cloud Services , 2019 .

[22]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[23]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[24]  Sebastian Burckhardt,et al.  Principles of Eventual Consistency , 2014, Found. Trends Program. Lang..

[25]  Peter Müller,et al.  Serializability for eventual consistency: criterion, analysis, and applications , 2017, POPL.

[26]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[27]  Parthasarathy Ranganathan,et al.  The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition , 2018, The Datacenter as a Computer.

[28]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[29]  Luis Ceze,et al.  Claret: using data types for highly concurrent distributed transactions , 2015, PaPoC@EuroSys.

[30]  Boris Grot,et al.  Scale-out ccNUMA: exploiting skew with strongly consistent caching , 2018, EuroSys.

[31]  Luis Ceze,et al.  Disciplined Inconsistency with Consistency Types , 2016, SoCC.

[32]  Stefanos Kaxiras,et al.  Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.

[33]  Thomas F. Wenisch,et al.  Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[34]  Lorenzo Alvisi,et al.  I Can't Believe It's Not Causal! Scalable Causal Consistency with No Slowdown Cascades , 2017, NSDI.

[35]  John B. Carter,et al.  Design of the Munin Distributed Shared Memory System , 1995, J. Parallel Distributed Comput..

[36]  Nancy A. Lynch,et al.  Robust emulation of shared memory using dynamic quorum-acknowledged broadcasts , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[37]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[38]  Rachid Guerraoui,et al.  Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[39]  Rachid Guerraoui,et al.  Incremental Consistency Guarantees for Replicated Objects , 2016, OSDI.

[40]  Benjamin Reed,et al.  A simple totally ordered broadcast protocol , 2008, LADIS '08.

[41]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[42]  David A. Patterson,et al.  The RISC-V instruction set , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[43]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[44]  Doug Terry,et al.  Replicated data consistency explained through baseball , 2013, CACM.

[45]  Sameh Elnikety,et al.  Orbe: scalable causal consistency using dependency matrices and physical clocks , 2013, SoCC.

[46]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[47]  Xiaozhou Li,et al.  NetChain: Scale-Free Sub-RTT Coordination , 2018, NSDI.

[48]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[49]  Michael L. Scott,et al.  Linearizability of Persistent Memory Objects Under a Full-System-Crash Failure Model , 2016, DISC.

[50]  R. V. Renesse,et al.  Derecho : Group Communication at the Speed of Light , 2016 .

[51]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[52]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[53]  Michael L. Scott,et al.  Efficient distributed shared state for heterogeneous machine architectures , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[54]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[55]  Michael J. Freedman,et al.  Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads , 2009, USENIX Annual Technical Conference.

[56]  Michael J. Freedman,et al.  Stronger Semantics for Low-Latency Geo-Replicated Storage , 2013, NSDI.

[57]  Hongseok Yang,et al.  'Cause I'm strong enough: Reasoning about consistency choices in distributed systems , 2016, POPL.

[58]  Willy Zwaenepoel,et al.  GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks , 2014, SoCC.

[59]  Ali Ghodsi,et al.  Bolt-on causal consistency , 2013, SIGMOD '13.

[60]  Andrew C. Myers,et al.  MixT: a language for mixing consistency in geodistributed transactions , 2018, PLDI.

[61]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[62]  Hai Huang,et al.  BESPOKV: Application Tailored Scale-Out Key-Value Stores , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[63]  Marko Vukolic,et al.  Consistency in Non-Transactional Distributed Storage Systems , 2015, ACM Comput. Surv..

[64]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[65]  Idit Keidar,et al.  On the Cost of Fault-Tolerant Consensus When There Are No Faults - A Tutorial , 2003, LADC.

[66]  Daniel Lustig,et al.  A Formal Analysis of the NVIDIA PTX Memory Consistency Model , 2019, ASPLOS.

[67]  Marcos K. Aguilera,et al.  Consistency-based service level agreements for cloud storage , 2013, SOSP.

[68]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[69]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[70]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[71]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[72]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[73]  Marc Shapiro,et al.  Conflict-Free Replicated Data Types , 2011, SSS.

[74]  Satish Narayanasamy,et al.  Persistency for synchronization-free regions , 2018, PLDI.

[75]  Suresh Jagannathan,et al.  Declarative programming over eventually consistent data stores , 2015, PLDI.

[76]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[77]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[78]  Gang Chen,et al.  Efficient Distributed Memory Management with RDMA and Caching , 2018, Proc. VLDB Endow..

[79]  Maged M. Michael,et al.  Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[80]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[81]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[82]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.