Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

There is currently an active debate on which RDMA primitive (i.e., one-sided or two-sided) is optimal for distributed transactions. Such a debate has led to a number of optimizations based on one RDMA primitive, which was shown with better performance than the other. In this paper, we perform a systematic comparison between different RDMA primitives with a combination of various optimizations using representative OLTP workloads. More specifically, we first implement and compare different RDMA primitives with existing and our new optimizations upon a single well-tuned execution framework. This gives us insights into the performance characteristics of different RDMA primitives. Then we investigate the implementation of optimistic concurrency control (OCC) by comparing different RDMA primitives using a phase-by-phase approach with various transactions from TPC-C, SmallBank, and TPC-E. Our results show that no single primitive (one-sided or two-sided) wins over the other on all phases. We further conduct an end-to-end comparison of prior designs on the same codebase and find none of them is optimal. Based on the above studies, we build DrTM+H, a new hybrid distributed transaction system that always embraces the optimal RDMA primitives at each phase of transactional execution. Evaluations using popular OLTP workloads including TPC-C and SmallBank show that DrTM+H achieves over 7.3 and 90.4 million transactions per second on a 16-node RDMA-capable cluster (ConnectX-4) respectively, without locality assumption. This number outperforms the pure one-sided and two-sided systems by up to 1.89X and 2.96X for TPC-C with over 49% and 65% latency reduction. Further, DrTM+H scales well with a large number of connections on modern RDMA network.

[1]  Haibo Chen,et al.  Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration , 2016, OSDI.

[2]  Cheng Wang,et al.  APUS: fast and scalable paxos on RDMA , 2017, SoCC.

[3]  Haibo Chen,et al.  Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data , 2017, SOSP.

[4]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[5]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[6]  Leslie Lamport,et al.  Vertical paxos and primary-backup replication , 2009, PODC '09.

[7]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[8]  Yang Zhang,et al.  Extracting More Concurrency from Distributed Transactions , 2014, OSDI.

[9]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[10]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[11]  Chao Xie,et al.  Salt: Combining ACID and BASE in a Distributed Database , 2014, OSDI.

[12]  Chao Xie,et al.  High-performance ACID via modular concurrency control , 2015, SOSP.

[13]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[14]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[15]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[16]  Hideaki Kimura,et al.  FOEDUS: OLTP Engine for a Thousand Cores and NVRAM , 2015, SIGMOD Conference.

[17]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[18]  Brian F. Cooper Spanner: Google's globally-distributed database , 2013, SYSTOR '13.

[19]  Ion Stoica,et al.  BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores , 2016, NSDI.

[20]  Haibo Chen,et al.  Replication-driven Live Reconfiguration for Fast Distributed Transaction Processing , 2017, USENIX Annual Technical Conference.

[21]  Ben Cassell Designing A Low-Latency Cuckoo Hash Table for Write-Intensive Workloads Using RDMA , 2014 .

[22]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[23]  Shuai Mu,et al.  The SNOW Theorem and Latency-Optimal Read-Only Transactions , 2016, OSDI.

[24]  Satoshi Matsushita,et al.  Implementing linearizability at large scale and low latency , 2015, SOSP.

[25]  Jinyang Li,et al.  Balancing CPU and Network in the Cell Distributed B-Tree Store , 2016, USENIX ATC.

[26]  Babak Falsafi,et al.  SABRes: Atomic object reads for in-memory rack-scale computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[28]  Kang Chen,et al.  RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[29]  Marcos K. Aguilera,et al.  Transaction chains: achieving serializability with low latency in geo-distributed storage systems , 2013, SOSP.

[30]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[31]  Nikolas Ioannou,et al.  Crail: A High-Performance I/O Architecture for Distributed Data Processing , 2017, IEEE Data Eng. Bull..

[32]  Michael Stonebraker,et al.  Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores , 2014, Proc. VLDB Endow..

[33]  Carsten Binnig,et al.  The End of a Myth: Distributed Transaction Can Scale , 2016, Proc. VLDB Endow..

[34]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[35]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[36]  Haibo Chen,et al.  Fast and general distributed transactions using RDMA and HTM , 2016, EuroSys.

[37]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[38]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[39]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[40]  Tao Li,et al.  Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[41]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[42]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[43]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[44]  Yiying Zhang,et al.  LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.

[45]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[46]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[47]  Haibo Chen,et al.  Using restricted transactional memory to build a scalable in-memory database , 2014, EuroSys '14.

[48]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[49]  Michael Stonebraker,et al.  An Evaluation of Distributed Concurrency Control , 2017, Proc. VLDB Endow..