HovercRaft: achieving scalability and fault-tolerance for microsecond-scale datacenter services

Cloud platform services must simultaneously be scalable, meet low tail latency service-level objectives, and be resilient to a combination of software, hardware, and network failures. Replication plays a fundamental role in meeting both the scalability and the fault-tolerance requirement, but is subject to opposing requirements: (1) scalability is typically achieved by relaxing consistency; (2) fault-tolerance is typically achieved through the consistent replication of state machines. Adding nodes to a system can therefore either increase performance at the expense of consistency, or increase resiliency at the expense of performance. We propose HovercRaft, a new approach by which adding nodes increases both the resilience and the performance of general-purpose state-machine replication. We achieve this through an extension of the Raft protocol that carefully eliminates CPU and I/O bottlenecks and load balances requests. Our implementation uses state-of-the-art kernel-bypass techniques, datacenter transport protocols, and in-network programmability to deliver up to 1 million operations/second for clusters of up to 9 nodes, linear speedup over unreplicated configuration for selected workloads, and a 4X speedup for the YCSBE-E benchmark running on Redis over an unreplicated deployment.

[1]  Junfeng Yang,et al.  Paxos made transparent , 2015, SOSP.

[2]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[3]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[4]  Edouard Bugnion,et al.  Lancet: A self-correcting Latency Measuring Tool , 2019, USENIX Annual Technical Conference.

[5]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[7]  Edouard Bugnion,et al.  R2P2: Making RPCs first-class datacenter citizens , 2019, USENIX ATC.

[8]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[9]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[10]  Cheng Wang,et al.  APUS: fast and scalable paxos on RDMA , 2017, SoCC.

[11]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[12]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[13]  Jialin Li,et al.  Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control , 2017, SOSP.

[14]  Jinyang Li,et al.  Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.

[15]  Xiao Liu,et al.  Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[16]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[17]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[18]  Barbara Liskov,et al.  Viewstamped Replication: A General Primary Copy , 1988, PODC.

[19]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[20]  Phone Lin,et al.  A Kubernetes-Based Monitoring Platform for Dynamic Cloud Resource Provisioning , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[21]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[22]  Peter Bailis,et al.  The network is reliable , 2014, Commun. ACM.

[23]  David G. Andersen,et al.  Paxos Quorum Leases: Fast Reads Without Sacrificing Writes , 2014, SoCC.

[24]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[25]  Fernando Pedone,et al.  Strong Consistency at Scale , 2016, IEEE Data Eng. Bull..

[26]  Fernando Pedone,et al.  The Case For In-Network Computing On Demand , 2019, EuroSys.

[27]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[28]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[29]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[30]  Satoshi Matsushita,et al.  Implementing linearizability at large scale and low latency , 2015, SOSP.

[31]  Fernando Pedone,et al.  Paxos Made Switch-y , 2015, CCRV.

[32]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[33]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[34]  Dong Zhou,et al.  Rex: replication at the speed of multi-core , 2014, EuroSys '14.

[35]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[36]  André Schiper,et al.  Optimistic active replication , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[37]  Xin Jin,et al.  Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection , 2019, Proc. VLDB Endow..

[38]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[39]  Robert Ricci,et al.  Splinter: Bare-Metal Extensions for Multi-Tenant Low-Latency Storage , 2018, OSDI.

[40]  Amar Phanishayee,et al.  PLATO: Predictive Latency-Aware Total Ordering , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[41]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[42]  Sandeep K. Singhal,et al.  Log-based receiver-reliable multicast for distributed interactive simulation , 1995, SIGCOMM '95.

[43]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[44]  Eric A. Brewer,et al.  Pushing the CAP: Strategies for Consistency and Availability , 2012, Computer.

[45]  Jian Yang,et al.  Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks , 2019, FAST.

[46]  Paulo R. Coelho,et al.  Kernel Paxos , 2018, 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS).

[47]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[48]  Leslie Lamport,et al.  Vertical paxos and primary-backup replication , 2009, PODC '09.

[49]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[50]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[51]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[52]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[53]  Amin Vahdat,et al.  Snap: a microkernel approach to host networking , 2019, SOSP.

[54]  Henry Qin,et al.  Fast key-value stores: An idea whose time has come and gone , 2019, HotOS.

[55]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[56]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[57]  GhemawatSanjay,et al.  The Google file system , 2003 .

[58]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[59]  Akkihebbal L. Ananda,et al.  A survey of remote procedure calls , 1990, OPSR.

[60]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[61]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[62]  Xiaozhou Li,et al.  NetChain: Scale-Free Sub-RTT Coordination , 2018, NSDI.

[63]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[64]  Jaeyoung Do,et al.  Programmable solid-state storage in future cloud datacenters , 2019, Commun. ACM.

[65]  Fernando Pedone,et al.  Consensus for Non-volatile Main Memory , 2018, 2018 IEEE 26th International Conference on Network Protocols (ICNP).

[66]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[67]  Jing Liu,et al.  I'm Not Dead Yet!: The Role of the Operating System in a Kernel-Bypass Era , 2019, HotOS.

[68]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[69]  Norman May,et al.  Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads , 2013, ADMS@VLDB.

[70]  Jim Gray,et al.  Fault Tolerance in Tandem Systems , 1985, High Performance Transaction Systems Workshop.

[71]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[72]  Gustavo Alonso,et al.  Consensus in a Box: Inexpensive Coordination in Hardware , 2016, NSDI.

[73]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[74]  Fernando Pedone,et al.  Quality-Aware Entity-Level Semantic Representations for Short Texts. , 2016 .

[75]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[76]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[77]  John K. Ousterhout,et al.  Exploiting Commutativity For Practical Fast Replication , 2017, NSDI.

[78]  Andrew S. Tanenbaum,et al.  Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[79]  M. S. Ali,et al.  Reliable Multicast Transport Protocol: RMTP , 2010 .

[80]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[81]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[82]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[83]  Mark Handley,et al.  Network stack specialization for performance , 2013, HotNets.

[84]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[85]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.