Kronos: the design and implementation of an event ordering service

This paper proposes a new approach to determining the order of interdependent operations in a distributed system. The key idea behind our approach is to factor the task of tracking happens-before relationships out of components that comprise the system, and to centralize them in a separate event ordering service. This not only simplifies implementation of individual components by freeing them from having to propagate dependence information, but also enables dependence relationships to be maintained across multiple independent systems. A novel API enables the system to detect and take advantage of concurrency whenever possible by maintaining fine-grained information and binding events to a time order as late as possible. We demonstrate the benefits of this approach through several example applications, including a transactional key-value store, and an online graph store. Experiments show that our event ordering service scales well and has low overhead in practice.

[1]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[2]  David R. Cheriton,et al.  Understanding the limitations of causally and totally ordered communication , 1994, SOSP '93.

[3]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[4]  Srinath T. V. Setty,et al.  Depot: Cloud Storage with Minimal Trust , 2010, TOCS.

[5]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[6]  D. A. Khotimsky,et al.  Hierarchical vector clock: scalable plausible clock for detecting causality in large distributed systems , 1999, 1999 2nd International Conference on ATM. ICATM'99 (Cat. No.99EX284).

[7]  Linda Torczon,et al.  An efficient representation for sparse sets , 1993, LOPL.

[8]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[9]  A. Rbnyi ON THE EVOLUTION OF RANDOM GRAPHS , 2001 .

[10]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[11]  Kenneth P. Birman,et al.  Fast causal multicast , 1990, EW 4.

[12]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[13]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[14]  Colin J. Fidge,et al.  Logical time in distributed computing systems , 1991, Computer.

[15]  Koenraad Audenaert,et al.  Clock Trees: Logical Clocks for Programs with Nested Parallelism , 1997, IEEE Trans. Software Eng..

[16]  Kenneth P. Birman,et al.  Fast Causal Multicast , 1991, ACM SIGOPS Oper. Syst. Rev..

[17]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[18]  Michael Stonebraker,et al.  A Formal Model of Crash Recovery in a Distributed System , 1983, IEEE Transactions on Software Engineering.

[19]  Barbara Liskov,et al.  Viewstamped Replication: A General Primary Copy , 1988, PODC.

[20]  Bernadette Charron-Bost,et al.  Concerning the Size of Logical Clocks in Distributed Systems , 1991, Inf. Process. Lett..

[21]  Lars Backstrom,et al.  Balanced label propagation for partitioning massive graphs , 2013, WSDM.

[22]  Ariel J. Feldman,et al.  SPORC: Group Collaboration using Untrusted Cloud Resources , 2010, OSDI.

[23]  Michael J. Freedman,et al.  Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads , 2009, USENIX Annual Technical Conference.

[24]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[25]  Jure Leskovec,et al.  Learning to Discover Social Circles in Ego Networks , 2012, NIPS.

[26]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[27]  Ali Ghodsi,et al.  The potential dangers of causal consistency and an explicit solution , 2012, SoCC '12.

[28]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[29]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[30]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[31]  Mona Attariyan,et al.  Using Causality to Diagnose Configuration Bugs , 2008, USENIX Annual Technical Conference.

[32]  B. Bollobás The evolution of random graphs , 1984 .

[33]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[34]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[35]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[36]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[37]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[38]  Mustaque Ahamad,et al.  Plausible clocks: constant size logical clocks for distributed systems , 1999, Distributed Computing.

[39]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[40]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[41]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[42]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX Annual Technical Conference.