SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures

Modern distributed key-value store-based applications rely on bulk-read operations like 'Multi-Get' (MGet) to accelerate their data serving phase. While state-of-the-art database systems employ SIMD-based techniques to optimize data-parallel operations on their in-memory structures, such as hash-tables, they have not been adapted into high-performance RDMA-accelerated key-value (KV) stores. In this paper, we present a holistic approach to designing high-performance SIMD-aware KV stores for emerging multi-core CPU architectures. Towards this, we first perform an in-depth study of the opportunities and challenges involved in leveraging AVX-512 vectorization-based parallel hash table designs with a state-of-the-art high-performance key-value store like RDMA-Memcached. Based on this, we propose a SIMD-Aware Client-Centric and Optimistic RDMA-based Key-Value Store, SCOR-KV, that optimally exploits 'RDMA+SIMD' to accelerate read-heavy MGet operations. SCOR-KV presents an SIMD-conscious KV store friendly hash table layout, that leverages the vertically vectorized N-way cuckoo hash table design with optimistic KV pair lookup schemes. To complement this, we propose RDMA-optimized SIMD-aware MGet communication protocols that offload the server-side pre-/post-processing overheads to the client, while enabling optimal end-to-end performance. Our performance evaluations over the latest Intel Skylake CPUs and IB EDR interconnects show that our proposed SCOR-KV can achieve up to 3.7-8.6x improvement in server-side Get throughput. Through our SIMD-aware RDMA schemes, SCOR-KV can also improve Multi-Get latencies for read-heavy YCSB workloads by about 2.2x, as compared to the RDMA-Memcached design running over the state-of-the-art CPU-optimized MemC3 hash table design.

[1]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[2]  Kang Chen,et al.  RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[3]  Kenneth A. Ross Efficient Hash Probes on Modern Processors , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[5]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[7]  Yang Li,et al.  Ultra-Fast Bloom Filters using SIMD techniques , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[8]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[9]  Nicolas Le Scouarnec Cuckoo++ hash tables: high-performance hash tables for networking applications , 2017, ANCS.

[10]  Kenneth A. Ross,et al.  Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[11]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[12]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[13]  Teng Wang,et al.  BurstMem: A high-performance burst buffer system for scientific applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[15]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[16]  Yuan Yuan,et al.  Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores , 2015, Proc. VLDB Endow..

[17]  Donald Kossmann,et al.  Fast Scans on Key-Value Stores , 2017, Proc. VLDB Endow..

[18]  Xiaoyi Lu,et al.  SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures* , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Peng Jiang,et al.  Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation , 2017, ICS.

[20]  Song Jiang,et al.  Characterizing Facebook's Memcached Workload , 2014, IEEE Internet Computing.

[21]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[22]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[23]  Dhabaleswar K. Panda,et al.  High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24]  Dhabaleswar K. Panda,et al.  Boldio: A hybrid and resilient burst-buffer over lustre for accelerating big data I/O , 2016, 2016 IEEE International Conference on Big Data (Big Data).