Balancing CPU and Network in the Cell Distributed B-Tree Store

In traditional client-server designs, all requests are processed at the server storing the state, thereby maintaining strict locality between computation and state. The adoption of RDMA (Remote Direct Memory Access) makes it practical to relax locality by letting clients fetch server state and process requests themselves. Such client-side processing improves performance when the server CPU, instead of the network, is the bottleneck. We observe that combining server-side and client-side processing allows systems to balance and adapt to the available CPU and network resources with minimal configuration, and can free resources for other CPU-intensive work. We present Cell, a distributed B-tree store that combines client-side and server-side processing. Cell distributes a global B-tree of "fat" (64MB) nodes across machines for server-side searches. Within each fat node, Cell organizes keys as a local B-tree of RDMA-friendly small nodes for client-side searches. Cell clients dynamically select whether to use client-side or server-side processing in response to available resources and the current workload. Our evaluation on a large RDMA-capable cluster show that Cell scales well and that its dynamic selector effectively responds to resource availability and workload properties.

[1]  Dhabaleswar K. Panda,et al.  Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[2]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[3]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[4]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[5]  Theodore Johnson,et al.  A distributed data-balanced dictionary based on the B-link tree , 1992, Proceedings Sixth International Parallel Processing Symposium.

[6]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[7]  Animesh Trivedi,et al.  Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft-RDMA to Boost Memcached , 2012, USENIX ATC.

[8]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[9]  Dhabaleswar K. Panda,et al.  Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems , 2012, 2012 IEEE 20th Annual Symposium on High-Performance Interconnects.

[10]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[11]  Marcos K. Aguilera,et al.  A practical scalable distributed B-tree , 2008, Proc. VLDB Endow..

[12]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[13]  Christopher Mitchell,et al.  Building Fast, CPU-Efficient Distributed Systems on Ultra-Low Latency, RDMA-Capable Networks , 2015 .

[14]  Wojciech M. Golab,et al.  Minuet: A Scalable Distributed Multiversion B-Tree , 2012, Proc. VLDB Endow..

[15]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[16]  Tapani Lehtonen,et al.  On the optimality of the shortest line discipline , 1984 .

[17]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[18]  Dhabaleswar K. Panda,et al.  MPI over InfiniBand: Early Experiences , 2003 .

[19]  Haibo Chen,et al.  Fast and general distributed transactions using RDMA and HTM , 2016, EuroSys.

[20]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[21]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[22]  Dennis Shasha,et al.  Concurrent search structure algorithms , 1988, TODS.

[23]  Amin Vahdat,et al.  TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System , 2013, TOCS.

[24]  Yehoshua Sagiv Concurrent Operations on B*-Trees with Overtaking , 1986, J. Comput. Syst. Sci..

[25]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[26]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[27]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[28]  Frank A. Haight,et al.  TWO QUEUES IN PARALLEL , 1958 .

[29]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[30]  Mor Harchol-Balter Performance Modeling and Design of Computer Systems: The M/G/1 Queue and the Inspection Paradox , 2013 .

[31]  Dhabaleswar K. Panda,et al.  High-Performance Design of HBase with RDMA over InfiniBand , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[33]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[34]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[35]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[36]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[37]  Matthew J. Koop,et al.  High-Performance Multi-Transport MPI Design for Ultra-Scale InfiniBand Clusters , 2009 .

[38]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[39]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[40]  Robert E. Tarjan,et al.  Deletion without Rebalancing in Multiway Search Trees , 2009, ISAAC.

[41]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[42]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.