RFP: When RPC is Faster than Server-Bypass with RDMA

Remote Direct Memory Access (RDMA) has been widely deployed in modern data centers. However, existing usages of RDMA lead to a dilemma between performance and redesign cost. They either directly replace socket-based send/receive primitives with the corresponding RDMA counterpart (server-reply), which only achieves moderate performance improvement; or push performance further by using one-sided RDMA operations to totally bypass the server (server-bypass), at the cost of redesigning the software. In this paper, we introduce two interesting observations about RDMA. First, RDMA has asymmetric performance characteristics, which can be used to improve server-reply's performance. Second, the performance of server-bypass is not as good as expected in many cases, because more rounds of RDMA may be needed if the server is totally bypassed. We therefore introduce a new RDMA paradigm called Remote Fetching Paradigm (RFP). Although RFP requires users to set several parameters to achieve the best performance, it supports the legacy RPC interfaces and hence avoids the need of redesigning application-specific data structures. Moreover, with proper parameters, it can achieve even higher IOPS than that of the previous paradigms. We have designed and implemented an in-memory key-value store based on RFP to evaluate its effectiveness. Experimental results show that RFP improves performance by 1.6×~4× compared with both server-reply and server-bypass paradigms.

[1]  M. Frans Kaashoek,et al.  CPHASH: a cache-partitioned hash table , 2012, PPoPP '12.

[2]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[3]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[4]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[5]  Li Zhang,et al.  C-Hint: An Effective and Reliable Cache Management for RDMA-Accelerated Key-Value Stores , 2014, SoCC.

[6]  Xiaozhou Li,et al.  Algorithmic improvements for fast concurrent Cuckoo hashing , 2014, EuroSys '14.

[7]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[9]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[10]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[11]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[12]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[13]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[14]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[15]  Animesh Trivedi,et al.  Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft-RDMA to Boost Memcached , 2012, USENIX ATC.

[16]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[17]  Yang Zhang,et al.  Extracting More Concurrency from Distributed Transactions , 2014, OSDI.

[18]  Dhabaleswar K. Panda,et al.  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[19]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[21]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[22]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[23]  Asim Kadav,et al.  MALT: distributed data-parallelism for existing ML applications , 2015, EuroSys.

[24]  Dhabaleswar K. Panda,et al.  PVFS over InfiniBand: design and performance evaluation , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[25]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[26]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[27]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[28]  Satoshi Matsushita,et al.  Implementing linearizability at large scale and low latency , 2015, SOSP.

[29]  Jinyang Li,et al.  Balancing CPU and Network in the Cell Distributed B-Tree Store , 2016, USENIX ATC.

[30]  Dhabaleswar K. Panda,et al.  High-Performance Design of HBase with RDMA over InfiniBand , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[31]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.