Applying Software-based Memory Error Correction for In-Memory Key-Value Store: Case Studies on Memcached and RAMCloud

With the nature of being memory hungry, in-memory key-value store is fundamentally subject to very high memory cost and energy consumption. Intuitively, the availability of a strong memory error correction at sufficiently small redundancy overhead could be leveraged to reduce memory cost and/or energy consumption. Nevertheless, current computing systems handle memory error correction solely in the hardware stack with very weak error correction strength. This paper for the first time studies the practical feasibility of implementing strong memory error correction code (ECC) in the software stack for in-memory key-value store without incurring significant speed performance penalty. This is fundamentally enabled by the low memory bandwidth utilization and relatively simple data structure of in-memory key-value store, which are actually shared with many other datacenter applications (e.g., Web search). This paper presents several design techniques to optimize software-based ECC implementation for in-memory key-value store, and elaborates on several important design issues. Using Memcached and RAMCloud as test vehicles, this work shows that the proposed design solution can improve the memory error correction strength by several orders of magnitude at similar (and even less) coding redundancy compared with current hardware-based design practice, and meanwhile incur less than 6% degradation of in-memory key-value store operational throughput.

[1]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[2]  Kushagra Vaid,et al.  Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[3]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[4]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[5]  Christoforos E. Kozyrakis,et al.  Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[6]  Thomas Willhalm,et al.  Quantifying the Performance Impact of Memory Latency and Bandwidth for Big Data Workloads , 2015, 2015 IEEE International Symposium on Workload Characterization.

[7]  J.L. Massey,et al.  Theory and practice of error control codes , 1986, Proceedings of the IEEE.

[8]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[12]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[13]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[14]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[15]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[16]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[17]  Michael Engel,et al.  RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[18]  Yuan Yuan,et al.  Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores , 2015, Proc. VLDB Endow..

[19]  M. Darnell,et al.  Error Control Coding: Fundamentals and Applications , 1985 .

[20]  Kiamal Z. Pekmestzi,et al.  Efficient Memory Repair Using Cache-Based Redundancy , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  D. C. Bossen,et al.  Orthogonal latin square codes , 1970 .

[22]  Stephen M. Rumble,et al.  Log-structured memory for DRAM-based storage , 2014, FAST.

[23]  Manu Awasthi Rethinking Design Metrics for Datacenter DRAM , 2015, MEMSYS.