Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead -- achieving throughput 3.6-35x higher and latency 72-80% lower than a conventional design on standard benchmarks.

[1]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[2]  Andrew S. Tanenbaum,et al.  Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[3]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[4]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[5]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[6]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[7]  Daniel Gómez Ferro,et al.  A critique of snapshot isolation , 2012, EuroSys '12.

[8]  Barbara Liskov,et al.  Practical uses of synchronized clocks in distributed systems , 1991, PODC '91.

[9]  Barbara Liskov,et al.  Viewstamped Replication Revisited , 2012 .

[10]  Sameh Elnikety,et al.  Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[11]  Thomas E. Anderson,et al.  F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[12]  Arvind Krishnamurthy,et al.  When Is Operation Ordering Required in Replicated Transactional Storage? , 2016, IEEE Data Eng. Bull..

[13]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[14]  Barbara Liskov,et al.  Granola: Low-Overhead Distributed Transaction Coordination , 2012, USENIX Annual Technical Conference.

[15]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[16]  Dahlia Malkhi,et al.  CORFU: A distributed shared log , 2013, TOCS.

[17]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[18]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[19]  Alan Fekete,et al.  YCSB+T: Benchmarking web-scale transactional databases , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[20]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[21]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[22]  Michael Stonebraker,et al.  OLTP through the looking glass, and what we found there , 2008, SIGMOD Conference.

[23]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[24]  Michael Stonebraker,et al.  The Performance of Concurrency Control Algorithms for Database Management Systems , 1984, VLDB.

[25]  Divyakant Agrawal,et al.  A Taxonomy of Partitioned Replicated Cloud-based Database Systems , 2015, IEEE Data Eng. Bull..

[26]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[27]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[28]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[29]  Martín Casado,et al.  Onix: A Distributed Control Platform for Large-scale Production Networks , 2010, OSDI.

[30]  Haibo Chen,et al.  Fast and general distributed transactions using RDMA and HTM , 2016, EuroSys.

[31]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[32]  Alfons Kemper,et al.  An Evaluation of Strict Timestamp Ordering Concurrency Control for Main-Memory Database Systems , 2013, IMDM@VLDB.

[33]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[34]  Philip A. Bernstein,et al.  Hyder - A Transactional Record Manager for Shared Flash , 2011, CIDR.

[35]  Divyakant Agrawal,et al.  Low-Latency Multi-Datacenter Databases using Replicated Commit , 2013, Proc. VLDB Endow..

[36]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[37]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[38]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[39]  AgrawalDivyakant,et al.  Low-latency multi-datacenter databases using replicated commit , 2013, VLDB 2013.

[40]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[41]  Jinyang Li,et al.  Consolidating Concurrency Control and Consensus for Commits under Conflicts , 2016, OSDI.

[42]  Ittai Abraham,et al.  vCorfu: A Cloud-Scale Object Store on a Shared Log , 2017, NSDI.

[43]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[44]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[45]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[46]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[47]  Ivan Beschastnikh,et al.  Scalable consistency in Scatter , 2011, SOSP.

[48]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[49]  Daniel J. Abadi,et al.  Low overhead concurrency control for partitioned main memory databases , 2010, SIGMOD Conference.

[50]  David P. Reed,et al.  Naming and synchronization in a decentralized computer system , 1978 .

[51]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[52]  Daniel J. Abadi,et al.  The case for determinism in database systems , 2010, Proc. VLDB Endow..

[53]  Daniel J. Abadi,et al.  Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems , 2014, ACM Trans. Database Syst..

[54]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[55]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[56]  Gordon J. Brebner,et al.  High-Speed Packet Processing using Reconfigurable Computing , 2014, IEEE Micro.

[57]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[58]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[59]  James A. Cowling Low-overhead distributed transaction coordination , 2012 .

[60]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[61]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.