No compromises: distributed transactions with consistency, availability, and performance

Transactions with strong consistency and high availability simplify building and reasoning about distributed systems. However, previous implementations performed poorly. This forced system designers to avoid transactions completely, to weaken consistency guarantees, or to provide single-machine transactions that require programmers to partition their data. In this paper, we show that there is no need to compromise in modern data centers. We show that a main memory distributed computing platform called FaRM can provide distributed transactions with strict serializability, high performance, durability, and high availability. FaRM achieves a peak throughput of 140 million TATP transactions per second on 90 machines with a 4.9 TB database, and it recovers from a failure in less than 50 ms. Key to achieving these results was the design of new transaction, replication, and recovery protocols from first principles to leverage commodity networks with RDMA and a new, inexpensive approach to providing non-volatile DRAM.

[1]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[2]  Ravi Sethi,et al.  Useless Actions Make a Difference: Strict Serializability of Database Updates , 1982, JACM.

[3]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[4]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[5]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[6]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[7]  Goetz Graefe,et al.  Write-Optimized B-Trees , 2004, VLDB.

[8]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[9]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[10]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[11]  Rachid Guerraoui,et al.  On the correctness of transactional memory , 2008, PPoPP.

[12]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[13]  Leslie Lamport,et al.  Vertical paxos and primary-backup replication , 2009, PODC '09.

[14]  Alec Wolman,et al.  Centrifuge: Integrated Lease Management and Partitioning for Cloud Services , 2010, NSDI.

[15]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX Annual Technical Conference.

[16]  Jignesh M. Patel,et al.  High-Performance Concurrency Control Mechanisms for Main-Memory Databases , 2011, Proc. VLDB Endow..

[17]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[18]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[19]  Wojciech M. Golab,et al.  Minuet: A Scalable Distributed Multiversion B-Tree , 2012, Proc. VLDB Endow..

[20]  Michael L. Scott,et al.  Sandboxing transactional memory , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  Craig Freedman,et al.  Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[22]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[23]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX Annual Technical Conference.

[24]  Michael Kaminsky,et al.  Using RDMA efficiently for key-value services , 2014, SIGCOMM.

[25]  Eddie Kohler,et al.  Fast Databases with Fast Durability and Recovery Through Multicore Parallelism , 2014, OSDI.

[26]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[27]  Stephen M. Rumble,et al.  Log-structured memory for DRAM-based storage , 2014, FAST.