论文信息 - HovercRaft: achieving scalability and fault-tolerance for microsecond-scale datacenter services - 字舞流文

HovercRaft: achieving scalability and fault-tolerance for microsecond-scale datacenter services

Cloud platform services must simultaneously be scalable, meet low tail latency service-level objectives, and be resilient to a combination of software, hardware, and network failures. Replication plays a fundamental role in meeting both the scalability and the fault-tolerance requirement, but is subject to opposing requirements: (1) scalability is typically achieved by relaxing consistency; (2) fault-tolerance is typically achieved through the consistent replication of state machines. Adding nodes to a system can therefore either increase performance at the expense of consistency, or increase resiliency at the expense of performance. We propose HovercRaft, a new approach by which adding nodes increases both the resilience and the performance of general-purpose state-machine replication. We achieve this through an extension of the Raft protocol that carefully eliminates CPU and I/O bottlenecks and load balances requests. Our implementation uses state-of-the-art kernel-bypass techniques, datacenter transport protocols, and in-network programmability to deliver up to 1 million operations/second for clusters of up to 9 nodes, linear speedup over unreplicated configuration for selected workloads, and a 4X speedup for the YCSBE-E benchmark running on Redis over an unreplicated deployment.

Edouard Bugnion | Marios Kogias | Marios Kogias | Edouard Bugnion

[1] Junfeng Yang,et al. Paxos made transparent , 2015, SOSP.

[2] Robert Griesemer,et al. Paxos made live: an engineering perspective , 2007, PODC '07.

[3] Jialin Li,et al. Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[4] Edouard Bugnion,et al. Lancet: A self-correcting Latency Measuring Tool , 2019, USENIX Annual Technical Conference.

[5] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6] John K. Ousterhout,et al. Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[7] Edouard Bugnion,et al. R2P2: Making RPCs first-class datacenter citizens , 2019, USENIX ATC.

[8] Yang Wang,et al. All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[9] Miguel Castro,et al. FaRM: Fast Remote Memory , 2014, NSDI.

[10] Cheng Wang,et al. APUS: fast and scalable paxos on RDMA , 2017, SoCC.

[11] Arvind Krishnamurthy,et al. Building consistent transactions with inconsistent replication , 2015, SOSP.

[12] Brett D. Fleisch,et al. The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[13] Jialin Li,et al. Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control , 2017, SOSP.

[14] Jinyang Li,et al. Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.

[15] Xiao Liu,et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[16] Ali Ghodsi,et al. Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[17] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[18] Barbara Liskov,et al. Viewstamped Replication: A General Primary Copy , 1988, PODC.

[19] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[20] Phone Lin,et al. A Kubernetes-Based Monitoring Platform for Dynamic Cloud Resource Provisioning , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[21] Timothy Roscoe,et al. Arrakis , 2014, OSDI.

[22] Peter Bailis,et al. The network is reliable , 2014, Commun. ACM.

[23] David G. Andersen,et al. Paxos Quorum Leases: Fast Reads Without Sacrificing Writes , 2014, SoCC.

[24] Fernando Pedone,et al. NetPaxos: consensus at network speed , 2015, SOSR.

[25] Fernando Pedone,et al. Strong Consistency at Scale , 2016, IEEE Data Eng. Bull..

[26] Fernando Pedone,et al. The Case For In-Network Computing On Demand , 2019, EuroSys.

[27] Jialin Li,et al. Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[28] Byung-Gon Chun,et al. Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[29] André Schiper,et al. Lightweight causal and atomic group multicast , 1991, TOCS.

[30] Satoshi Matsushita,et al. Implementing linearizability at large scale and low latency , 2015, SOSP.

[31] Fernando Pedone,et al. Paxos Made Switch-y , 2015, CCRV.

[32] Eunyoung Jeong,et al. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[33] John K. Ousterhout,et al. In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[34] Dong Zhou,et al. Rex: replication at the speed of multi-core , 2014, EuroSys '14.

[35] Michael Kaminsky,et al. Datacenter RPCs can be General and Fast , 2018, NSDI.

[36] André Schiper,et al. Optimistic active replication , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[37] Xin Jin,et al. Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection , 2019, Proc. VLDB Endow..

[38] Christoforos E. Kozyrakis,et al. Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[39] Robert Ricci,et al. Splinter: Bare-Metal Extensions for Multi-Tenant Low-Latency Storage , 2018, OSDI.

[40] Amar Phanishayee,et al. PLATO: Predictive Latency-Aware Total Ordering , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[41] Leslie Lamport,et al. Fast Paxos , 2006, Distributed Computing.

[42] Sandeep K. Singhal,et al. Log-based receiver-reliable multicast for distributed interactive simulation , 1995, SIGCOMM '95.

[43] Nancy A. Lynch,et al. Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[44] Eric A. Brewer,et al. Pushing the CAP: Strategies for Consistency and Availability , 2012, Computer.

[45] Jian Yang,et al. Orion: A Distributed File System for Non-Volatile Main Memory and RDMA-Capable Networks , 2019, FAST.

[46] Paulo R. Coelho,et al. Kernel Paxos , 2018, 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS).

[47] Christopher Frost,et al. Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[48] Leslie Lamport,et al. Vertical paxos and primary-backup replication , 2009, PODC '09.

[49] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[50] Ju Wang,et al. Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[51] Gautam Kumar,et al. pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[52] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[53] Amin Vahdat,et al. Snap: a microkernel approach to host networking , 2019, SOSP.

[54] Henry Qin,et al. Fast key-value stores: An idea whose time has come and gone , 2019, HotOS.

[55] David R. Cheriton,et al. Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[56] Yawei Li,et al. Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[57] GhemawatSanjay,et al. The Google file system , 2003 .

[58] Edouard Bugnion,et al. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[59] Akkihebbal L. Ananda,et al. A survey of remote procedure calls , 1990, OPSR.

[60] Torsten Hoefler,et al. DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[61] Mahadev Konar,et al. ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[62] Xiaozhou Li,et al. NetChain: Scale-Free Sub-RTT Coordination , 2018, NSDI.

[63] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[64] Jaeyoung Do,et al. Programmable solid-state storage in future cloud datacenters , 2019, Commun. ACM.

[65] Fernando Pedone,et al. Consensus for Non-volatile Main Memory , 2018, 2018 IEEE 26th International Conference on Network Protocols (ICNP).

[66] Thomas E. Anderson,et al. TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[67] Jing Liu,et al. I'm Not Dead Yet!: The Role of the Operating System in a Kernel-Bypass Era , 2019, HotOS.

[68] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[69] Norman May,et al. Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads , 2013, ADMS@VLDB.

[70] Jim Gray,et al. Fault Tolerance in Tandem Systems , 1985, High Performance Transaction Systems Workshop.

[71] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[72] Gustavo Alonso,et al. Consensus in a Box: Inexpensive Coordination in Hardware , 2016, NSDI.

[73] Ashish Gupta,et al. The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[74] Fernando Pedone,et al. Quality-Aware Entity-Level Semantic Representations for Short Texts. , 2016 .

[75] Flavio Paiva Junqueira,et al. Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[76] Hari Balakrishnan,et al. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[77] John K. Ousterhout,et al. Exploiting Commutativity For Practical Fast Replication , 2017, NSDI.

[78] Andrew S. Tanenbaum,et al. Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[79] M. S. Ali,et al. Reliable Multicast Transport Protocol: RMTP , 2010 .

[80] Keith Marzullo,et al. Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[81] David G. Andersen,et al. There is more consensus in Egalitarian parliaments , 2013, SOSP.

[82] Nick McKeown,et al. pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[83] Mark Handley,et al. Network stack specialization for performance , 2013, HotNets.

[84] Leslie Lamport,et al. Paxos Made Simple , 2001 .

[85] Jinyang Li,et al. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.