Building consistent transactions with inconsistent replication

Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google’s Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use – in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. We present TAPIR – the Transactional Application Protocol for Inconsistent Replication – the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIRKV provides better latency and throughput.

[1]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[2]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[3]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[4]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[5]  Eddie Kohler,et al.  The scalable commutativity rule , 2017, Commun. ACM.

[6]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[7]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[8]  Liuba Shrira,et al.  Providing high availability using lazy replication , 1992, TOCS.

[9]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[10]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[11]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[12]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[13]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[14]  Emin Gün Sirer,et al.  Warp: Lightweight Multi-Key Transactions for Key-Value Stores , 2015, ArXiv.

[15]  Leslie Lamport,et al.  The temporal logic of actions , 1994, TOPL.

[16]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[17]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[18]  Marcos K. Aguilera,et al.  Transaction chains: achieving serializability with low latency in geo-distributed storage systems , 2013, SOSP.

[19]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[20]  Divyakant Agrawal,et al.  Low-Latency Multi-Datacenter Databases using Replicated Commit , 2013, Proc. VLDB Endow..

[21]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[22]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[23]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[24]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[25]  Marcos K. Aguilera,et al.  Transactional storage for geo-replicated systems , 2011, SOSP.

[26]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[27]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[28]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[29]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[30]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[31]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[32]  Barbara Liskov,et al.  Granola: Low-Overhead Distributed Transaction Coordination , 2012, USENIX Annual Technical Conference.

[33]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[34]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[35]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[36]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[37]  Leslie Lamport Lower bounds for asynchronous consensus , 2003 .

[38]  Alan Fekete,et al.  YCSB+T: Benchmarking web-scale transactional databases , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[39]  Miguel Castro,et al.  Providing Persistent Objects in Distributed Systems , 1999, ECOOP.