NetRS: Cutting Response Latency in Distributed Key-Value Stores with In-Network Replica Selection

In distributed key-value stores, performance fluctuations generally occur across servers, especially when the servers are deployed in a cloud environment. Hence, the replica selected for a request will directly affect the response latency. In the context of key-value stores, even the state-of-the-art algorithm of replica selection still has considerable room for improving the response latency. In this paper, we present the fundamental factors that prevent replica selection algorithms from being effective. We address these factors by proposing NetRS, a framework that enables in-network replica selection for key-value stores. NetRS exploits emerging network devices, including programmable switches and network accelerators, to select replicas for requests. NetRS supports diverse algorithms of replica selection and is suited to the network topology of modern data centers. Compared with the conventional scheme of clients selecting replicas for requests, NetRS could effectively cut the response latency according to our extensive evaluations. Specifically, NetRS reduces the average latency by up to 48.4%, and the 99th latency by up to 68.7%.

[1]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[2]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[3]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[4]  Xi Li,et al.  Mayflower: Improving Distributed Filesystem Performance Through SDN/Filesystem Co-Design , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[5]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[6]  Ben Y. Zhao,et al.  Efficient Batched Synchronization in Dropbox-Like Cloud Storage Services , 2013, Middleware.

[7]  Jacob Nelson,et al.  IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.

[8]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[9]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[10]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[11]  Timothy Wood,et al.  NetKV: Scalable, Self-Managing, Load Balancing as a Network Function , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[12]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[13]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[14]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[15]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Yang Li,et al.  Towards Web-based Delta Synchronization for Cloud Storage Services , 2018, FAST.

[17]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[18]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[19]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[20]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[21]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[22]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Jialin Li,et al.  Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control , 2017, SOSP.

[24]  Xiaozhou Li,et al.  Be Fast, Cheap and in Control with SwitchKV , 2016, NSDI.

[25]  Yu Hua Cheetah: An efficient flat addressing scheme for fast query services in cloud computing , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[26]  Yunhao Liu,et al.  Towards Network-level Efficiency for Cloud Storage Services , 2014, Internet Measurement Conference.

[27]  Fernando M. V. Ramos,et al.  Software-Defined Networking: A Comprehensive Survey , 2014, Proceedings of the IEEE.

[28]  Brighten Godfrey,et al.  Low latency via redundancy , 2013, CoNEXT.

[29]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.