Systems and Algorithms for High-Performance, Cost-Efficient Key-Value Storage

Key-value storage systems are increasingly essential building blocks of modern cloud and big data applications. The workloads these systems support often require random access to small objects over massive datasets with highly skewed and dynamic key popularity. It is challenging for a storage cluster to serve these workloads with both high performance and low-cost operations. Today’s systems usually sacrifice one for the other. In this dissertation, we present novel approaches to improve both the performance and cost-efficiency of key-value systems by combining new hardware and software techniques with careful architectural design and algorithmic optimizations. First, at cluster scale, we build SwitchKV, a heterogeneous system that uses small high-end cache nodes to guarantee load balancing across many SSD-based backend nodes under nearlyarbitrary workloads. The cache nodes absorb the hottest queries so that no individual backend node is over-burdened or underutilized. The system exploits OpenFlow switches to enable efficient content-aware routing so that it can achieve scalable high throughput, low tail latency, and high availability. It uses new algorithms to keep the cache and switch forwarding rules updated with low overhead, and to ensure stable high performance under rapidly changing workloads. SwitchKV can meet the service level objectives for many cloud services more efficiently than traditional systems. Second, to improve the efficiency of each individual multi-core server, we build a highthroughput and memory-efficient concurrent hash table based around optimistic cuckoo hashing. Our re-design minimizes critical section length, reduces interprocessor coherence traffic, and enables effective prefetching through careful algorithm and data structure engineering. We explore hardware transactional memory and fine-grained locking for concurrency control, and find that both of them require the same level of algorithmic efforts to achieve high performance. Our new hash table design greatly outperforms other optimized concurrent hash tables for both readand write-heavy workloads, even while using substantially less memory for small key-value items.

[1]  Maged M. Michael,et al.  Evaluation of Blue Gene/Q hardware support for transactional memories , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[3]  Michael Stonebraker,et al.  E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing , 2014, Proc. VLDB Endow..

[4]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[5]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[6]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[7]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[8]  Robert Tappan Morris,et al.  Improving network connection locality on multicore systems , 2012, EuroSys '12.

[10]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[11]  Eitan Frachtenberg,et al.  Many-core key-value store , 2011, 2011 International Green Computing Conference and Workshops.

[12]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[13]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[14]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[15]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[16]  Ali Raza Butt,et al.  An in-memory object caching framework with adaptive load balancing , 2015, EuroSys.

[17]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[18]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[19]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[20]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[22]  Dan Grossman,et al.  ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Xiaozhou Li,et al.  Be Fast, Cheap and in Control with SwitchKV , 2016, NSDI.

[24]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[25]  Timothy J. Slegel,et al.  Transactional Memory Architecture and Implementation for IBM System Z , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[27]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[28]  Úlfar Erlingsson,et al.  A cool and practical alternative to traditional hash tables , 2006 .

[29]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[30]  Markus Klems,et al.  The Yahoo!: cloud datastore load balancer , 2012, CloudDB '12.

[31]  Torvald Riegel,et al.  Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack , 2010, EuroSys '10.

[32]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[33]  Mark Moir,et al.  Early experience with a commercial hardware transactional memory implementation , 2009, ASPLOS.

[34]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[35]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[36]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[37]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[38]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[39]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[40]  Michael J. Freedman,et al.  Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads , 2009, USENIX Annual Technical Conference.

[41]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[42]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[43]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[44]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[45]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[46]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[47]  Fernando M. V. Ramos,et al.  Software-Defined Networking: A Comprehensive Survey , 2014, Proceedings of the IEEE.

[48]  Maurice Herlihy,et al.  Hopscotch Hashing , 2008, DISC.

[49]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[50]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[51]  Bin Fan,et al.  Small cache, big effect: provable load balancing for randomly partitioned cluster services , 2011, SoCC.

[52]  Jonathan Walpole,et al.  Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming , 2011, USENIX ATC.

[53]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[54]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[55]  Xiaozhou Li,et al.  Algorithmic improvements for fast concurrent Cuckoo hashing , 2014, EuroSys '14.

[56]  Bo Hong,et al.  Managing flash crowds on the Internet , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[57]  Marc Tremblay,et al.  Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.

[58]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.