ZooKeeper: Wait-free Coordination for Internet-scale Systems

In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client. It incorporates elements from group messaging, shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by Zoo-Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service. The ZooKeeper interface enables a high-performance service implementation. In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second. This performance allows ZooKeeper to be used extensively by client applications.

[1]  Kenneth P. Birman,et al.  Replication and fault-tolerance in the ISIS system , 1985, SOSP '85.

[2]  Nancy P. Kronenberg,et al.  VAXclusters (extended abstract): a closely-coupled distributed system , 1985, SOSP 1985.

[3]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[4]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[5]  Andrew B. Hastings,et al.  Distributed lock management in a transaction processing environment , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[6]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[7]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[8]  Sape J. Mullender,et al.  Distributed systems (2nd Ed.) , 1993 .

[9]  Sape J. Mullender Distributed Systems (2nd edition) , 1993 .

[10]  Louise E. Moser,et al.  The Totem system , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[12]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[13]  Robbert van Renesse,et al.  Building adaptive systems using ensemble , 1998 .

[14]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[15]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[16]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[17]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[18]  Joel Wein,et al.  ACMS: the Akamai configuration management system , 2005, NSDI.

[19]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[20]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[21]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[22]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[23]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[24]  Sam Toueg,et al.  A robust and lightweight stable leader election service for dynamic systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[25]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[26]  Miguel Correia,et al.  DepSpace: a byzantine fault-tolerant coordination service , 2008, Eurosys '08.

[27]  Benjamin Reed,et al.  A simple totally ordered broadcast protocol , 2008, LADIS '08.

[28]  Sangmin Lee,et al.  Upright cluster services , 2009, SOSP '09.

[29]  Petr Kuznetsov,et al.  Zeno: Eventually Consistent Byzantine-Fault Tolerance , 2009, NSDI.