Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores

This paper introduces the concept of size-aware sharding to improve tail latencies for in-memory key-value stores, and describes its implementation in the Minos key-value store. Tail latencies are crucial in distributed applications with high fan-out ratios, because overall response time is determined by the slowest response. Size-aware sharding distributes requests for keys to cores according to the size of the item associated with the key. In particular, requests for small and large items are sent to disjoint subsets of cores. Size-aware sharding improves tail latencies by avoiding head-of-line blocking, in which a request for a small item gets queued behind a request for a large item. Alternative size-unaware approaches to sharding, such as keyhash-based sharding, request dispatching and stealing do not avoid head-of-line blocking, and therefore exhibit worse tail latencies. The challenge in implementing size-aware sharding is to maintain high throughput by avoiding the cost of software dispatching and by achieving load balancing between different cores. Minos uses hardware dispatch for all requests for small items, which form the very large majority of all requests. It achieves load balancing by adapting the number of cores handling requests for small and large items to their relative presence in the workload. We compare Minos to three state-of-the-art designs of in-memory KV stores. Compared to its closest competitor, Minos achieves a 99th percentile latency that is up to two orders of magnitude lower. Put differently, for a given value for the 99th percentile latency equal to 10 times the mean service time, Minos achieves a throughput that is up to 7.4 times higher.

[1]  M. Frans Kaashoek,et al.  CPHASH: a cache-partitioned hash table , 2012, PPoPP '12.

[2]  Indranil Gupta,et al.  Ambry: LinkedIn's Scalable Geo-Distributed Object Store , 2016, SIGMOD Conference.

[3]  Rachid Guerraoui,et al.  TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores , 2017, USENIX Annual Technical Conference.

[4]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[5]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[6]  Chenyang Lu,et al.  Work stealing for interactive services to meet target latency , 2016, PPoPP.

[7]  Mor Harchol-Balter,et al.  Analysis of SRPT scheduling: investigating unfairness , 2001, SIGMETRICS '01.

[8]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[9]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[10]  Robert N. M. Watson,et al.  Queues Don't Matter When You Can JUMP Them! , 2015, NSDI.

[11]  Wei Sun,et al.  Workload-aware load balancing for clustered Web servers , 2005, IEEE Transactions on Parallel and Distributed Systems.

[12]  Kang Chen,et al.  RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[13]  Willy Zwaenepoel,et al.  Job-aware Scheduling in Eagle: Divide and Stick to Your Probes , 2016, SoCC.

[14]  Liang Guo,et al.  The war between mice and elephants , 2001, Proceedings Ninth International Conference on Network Protocols. ICNP 2001.

[15]  Marco Canini,et al.  Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling , 2017, EuroSys.

[16]  Mor Harchol-Balter Task assignment with unknown duration , 2002, JACM.

[17]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[18]  Mor Harchol-Balter,et al.  Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .

[19]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[20]  Bo Hong,et al.  File System Workload Analysis For Large Scientific Computing Applications , 2004, MSST.

[21]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[22]  Mor Harchol-Balter,et al.  Size-based scheduling to improve web performance , 2003, TOCS.

[23]  Gianfranco Ciardo,et al.  EQUILOAD: a load balancing policy for clustered web servers , 2001, Perform. Evaluation.

[24]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[25]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[26]  Ryan Stutsman,et al.  Memshare: a Dynamic Multi-tenant Key-value Cache , 2017, USENIX Annual Technical Conference.

[27]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[28]  Haibo Chen,et al.  Fast and general distributed transactions using RDMA and HTM , 2016, EuroSys.

[29]  Rodrigo Fonseca,et al.  2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services , 2016, SIGCOMM.

[30]  Ling Liu,et al.  Scaling Out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory , 2015, HotStorage.

[31]  Gabriel Antoniu,et al.  Tailwind: Fast and Atomic RDMA-based Replication , 2018, USENIX ATC.

[32]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[33]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[34]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[35]  Yuan Yuan,et al.  Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores , 2015, Proc. VLDB Endow..

[36]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[37]  Mike O'Connor,et al.  MemcachedGPU: scaling-up scale-out key-value stores , 2015, SoCC.

[38]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[39]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[40]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[41]  Thu D. Nguyen,et al.  Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[43]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[44]  Wei Bai,et al.  Information-Agnostic Flow Scheduling for Commodity Data Centers , 2015, NSDI.

[45]  Xiaozhou Li,et al.  Be Fast, Cheap and in Control with SwitchKV , 2016, NSDI.

[46]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[47]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[48]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[49]  Robert Ricci,et al.  Rocksteady: Fast Migration for Low-latency In-memory Storage , 2017, SOSP.

[50]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[51]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[52]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[53]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[54]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[55]  Christoforos E. Kozyrakis,et al.  Corrigendum to “The IX Operating System: Combining Low Latency, High Throughput and Efficiency in a Protected Dataplane” , 2017, ACM Trans. Comput. Syst..

[56]  Willy Zwaenepoel,et al.  Kairos: Preemptive Data Center Scheduling Without Runtime Estimates , 2018, SoCC.

[57]  Ramesh K. Sitaraman,et al.  AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network , 2017, NSDI.