Consensus in a Box: Inexpensive Coordination in Hardware

Consensus mechanisms for ensuring consistency are some of the most expensive operations in managing large amounts of data. Often, there is a trade off that involves reducing the coordination overhead at the price of accepting possible data loss or inconsistencies. As the demand for more efficient data centers increases, it is important to provide better ways of ensuring consistency without affecting performance. In this paper we show that consensus (atomic broadcast) can be removed from the critical path of performance by moving it to hardware. As a proof of concept, we implement Zookeeper's atomic broadcast at the network level using an FPGA. Our design uses both TCP and an application specific network protocol. The design can be used to push more value into the network, e.g., by extending the functionality of middleboxes or adding inexpensive consensus to in-network processing nodes. To illustrate how this hardware consensus can be used in practical systems, we have combined it with a mainmemory key value store running on specialized microservers (built as well on FPGAs). This results in a distributed service similar to Zookeeper that exhibits high and stable performance. This work can be used as a blueprint for further specialized designs.

[1]  Divyakant Agrawal,et al.  Low-Latency Multi-Datacenter Databases using Replicated Commit , 2013, Proc. VLDB Endow..

[2]  Gustavo Alonso,et al.  A flexible hash table design for 10GBPS key-value stores on FPGAS , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[3]  Gustavo Alonso,et al.  A Hash Table for Line-Rate Data Processing , 2015, TRETS.

[4]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[5]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[6]  Gustavo Alonso,et al.  Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[7]  Wilson C. Hsieh,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 251 Spanner: Google's Globally-distributed Database , 2022 .

[8]  Michael J. Freedman,et al.  Stronger Semantics for Low-Latency Geo-Replicated Storage , 2013, NSDI.

[9]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[10]  Fernando Pedone,et al.  Paxos Made Switch-y , 2015, CCRV.

[11]  Dan Dobre,et al.  Hybris: Robust Hybrid Cloud Storage , 2014, SoCC.

[12]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[13]  Haitao Wu,et al.  ServerSwitch: A Programmable and High Performance Platform for Data Center Networks , 2011, NSDI.

[14]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[15]  Alexander L. Wolf,et al.  NaaS: Network-as-a-Service in the Cloud , 2012, Hot-ICE.

[16]  K. K. Ramakrishnan,et al.  SmartSwitch: Blurring the Line Between Network Infrastructure & Cloud Applications , 2014, HotCloud.

[17]  André Medeiros,et al.  ZooKeeper ’ s atomic broadcast protocol : Theory and practice , 2012 .

[18]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[19]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[20]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[21]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[22]  Nick Feamster,et al.  SwitchBlade: a platform for rapid deployment of network protocols on programmable hardware , 2010, SIGCOMM '10.

[23]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[24]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[25]  Emin Gün Sirer,et al.  SideCar: building programmable datacenter networks without programmable switches , 2010, Hotnets-IX.

[26]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[27]  Robert T. Braden,et al.  Requirements for Internet Hosts - Communication Layers , 1989, RFC.

[28]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[29]  Yair Amir,et al.  Paxos for System Builders: an overview , 2008, LADIS '08.

[30]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[31]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[32]  Gordon J. Brebner,et al.  400 Gb/s Programmable Packet Parsing on a Single FPGA , 2011, 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems.

[33]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[34]  Jonathan Rose,et al.  CALL FOR ARTICLES IEEE Design & Test of Computers Special Issue on Microprocessors , 1996 .

[35]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[36]  Weirong Jiang Scalable Ternary Content Addressable Memory implementation using FPGAs , 2013, Architectures for Networking and Communications Systems.

[37]  James R. Larus,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[38]  Amin Vahdat,et al.  xOMB: Extensible Open MiddleBoxes with commodity servers , 2012, 2012 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[39]  Roberto Bifulco,et al.  ClickOS and the Art of Network Function Virtualization , 2014, NSDI.

[40]  Yi Pan,et al.  PLUG: flexible lookup modules for rapid deployment of new protocols in high-speed routers , 2009, SIGCOMM '09.

[41]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[42]  Doug Terry,et al.  Replicated data consistency explained through baseball , 2013, CACM.

[43]  Alexander L. Wolf,et al.  NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data Centres , 2014, CoNEXT.

[44]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[45]  Jens Teubner,et al.  Data Processing on FPGAs , 2013, Proc. VLDB Endow..

[46]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[47]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[48]  Leslie Lamport,et al.  Vertical paxos and primary-backup replication , 2009, PODC '09.

[49]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[50]  Kunle Olukotun,et al.  Hardware system synthesis from Domain-Specific Languages , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).