Leveraging sharding in the design of scalable replication protocols

Most if not all datacenter services use sharding and replication for scalability and reliability. Shards are more-or-less independent of one another and individually replicated. In this paper, we challenge this design philosophy and present a replication protocol where the shards interact with one another: A protocol running within shards ensures linearizable consistency, while the shards interact in order to improve availability. We provide a specification for the protocol, prove its safety, analyze its liveness and availability properties, and evaluate a working implementation.

[1]  Calton Pu,et al.  Replica control in distributed systems: as asynchronous approach , 1991, SIGMOD '91.

[2]  Benjamin Reed,et al.  A simple totally ordered broadcast protocol , 2008, LADIS '08.

[3]  Leslie Lamport,et al.  Cheap Paxos , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Leslie Lamport,et al.  Vertical paxos and primary-backup replication , 2009, PODC '09.

[5]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Ivan Beschastnikh,et al.  Scalable consistency in Scatter , 2011, SOSP.

[7]  Leslie Lamport,et al.  Brief Announcement: Vertical Paxos and Primary-Backup Replication , 2009 .

[8]  Dahlia Malkhi Virtually Synchronous Methodology for Dynamic Service Replication , 2010 .

[9]  Werner Vogels,et al.  Eventually consistent , 2008, CACM.

[10]  Amin Vahdat,et al.  The costs and limits of availability for replicated services , 2001, TOCS.

[11]  Robert Grimm,et al.  PADS: A Policy Architecture for Distributed Storage Systems , 2009, NSDI.

[12]  Joe L. Armstrong The development of Erlang , 1997, ICFP '97.

[13]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[14]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[15]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[16]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[17]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[18]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[19]  GhemawatSanjay,et al.  The Google file system , 2003 .

[20]  Hector Garcia-Molina,et al.  Consistency in a partitioned network: a survey , 1985, CSUR.

[21]  R. Grimm,et al.  PADS : A Policy Architecture for Building Distributed Storage Systems , 2008 .

[22]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[23]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[24]  Scott Shenker,et al.  Key Consistency in DHTs , 2005 .

[25]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[26]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[27]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[28]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[29]  Gustavo Alonso,et al.  Are quorums an alternative for data replication? , 2003, TODS.

[30]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[31]  M. Herlihy A quorum-consensus replication method for abstract data types , 1986, TOCS.

[32]  Ali Ghodsi,et al.  Eventual consistency today: limitations, extensions, and beyond , 2013, CACM.

[33]  Seif Haridi,et al.  On Consistency Of Data In Structured Overlay Networks , 2008, CoreGRID Integration Workshop.

[34]  Jehan-François Pâris,et al.  Voting with Witnesses: A Constistency Scheme for Replicated Files , 1986, ICDCS.

[35]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[36]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[37]  Amin Vahdat,et al.  The costs and limits of availability for replicated services , 2006 .

[38]  Robert Morris,et al.  Etna: A Fault-tolerant Algorithm for Atomic Mutable DHT Data , 2005 .

[39]  Rodrigo Rodrigues,et al.  Rosebud: A Scalable Byzantine-Fault-Tolerant Storage Architecture , 2003 .

[40]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[41]  Robbert van Renesse,et al.  Byzantine Chain Replication , 2012, OPODIS.

[42]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[43]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[44]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[45]  Liuba Shrira,et al.  The design of a robust peer-to-peer system , 2002, EW 10.

[46]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[47]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.