论文信息 - SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures

SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures

Modern distributed key-value store-based applications rely on bulk-read operations like 'Multi-Get' (MGet) to accelerate their data serving phase. While state-of-the-art database systems employ SIMD-based techniques to optimize data-parallel operations on their in-memory structures, such as hash-tables, they have not been adapted into high-performance RDMA-accelerated key-value (KV) stores. In this paper, we present a holistic approach to designing high-performance SIMD-aware KV stores for emerging multi-core CPU architectures. Towards this, we first perform an in-depth study of the opportunities and challenges involved in leveraging AVX-512 vectorization-based parallel hash table designs with a state-of-the-art high-performance key-value store like RDMA-Memcached. Based on this, we propose a SIMD-Aware Client-Centric and Optimistic RDMA-based Key-Value Store, SCOR-KV, that optimally exploits 'RDMA+SIMD' to accelerate read-heavy MGet operations. SCOR-KV presents an SIMD-conscious KV store friendly hash table layout, that leverages the vertically vectorized N-way cuckoo hash table design with optimistic KV pair lookup schemes. To complement this, we propose RDMA-optimized SIMD-aware MGet communication protocols that offload the server-side pre-/post-processing overheads to the client, while enabling optimal end-to-end performance. Our performance evaluations over the latest Intel Skylake CPUs and IB EDR interconnects show that our proposed SCOR-KV can achieve up to 3.7-8.6x improvement in server-side Get throughput. Through our SIMD-aware RDMA schemes, SCOR-KV can also improve Multi-Get latencies for read-heavy YCSB workloads by about 2.2x, as compared to the RDMA-Memcached design running over the state-of-the-art CPU-optimized MemC3 hash table design.

[1] Efraim Rotem,et al. Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[2] Kang Chen,et al. RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[3] Kenneth A. Ross. Efficient Hash Probes on Modern Processors , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4] Song Jiang,et al. Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[5] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6] David G. Andersen,et al. Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[7] Yang Li,et al. Ultra-Fast Bloom Filters using SIMD techniques , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[8] Jinyang Li,et al. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[9] Nicolas Le Scouarnec. Cuckoo++ hash tables: high-performance hash tables for networking applications , 2017, ANCS.

[10] Kenneth A. Ross,et al. Rethinking SIMD Vectorization for In-Memory Databases , 2015, SIGMOD Conference.

[11] Kenneth A. Ross,et al. Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[12] Hyeontaek Lim,et al. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[13] Teng Wang,et al. BurstMem: A high-performance burst buffer system for scientific applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14] Miguel Castro,et al. FaRM: Fast Remote Memory , 2014, NSDI.

[15] Sayantan Sur,et al. Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[16] Yuan Yuan,et al. Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores , 2015, Proc. VLDB Endow..

[17] Donald Kossmann,et al. Fast Scans on Key-Value Stores , 2017, Proc. VLDB Endow..

[18] Xiaoyi Lu,et al. SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures* , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[19] Peng Jiang,et al. Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation , 2017, ICS.

[20] Song Jiang,et al. Characterizing Facebook's Memcached Workload , 2014, IEEE Internet Computing.

[21] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[22] Bin Fan,et al. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[23] Dhabaleswar K. Panda,et al. High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24] Dhabaleswar K. Panda,et al. Boldio: A hybrid and resilient burst-buffer over lustre for accelerating big data I/O , 2016, 2016 IEEE International Conference on Big Data (Big Data).