Error Correction and Detection for Computing Memories Using System Side Information

Error correction and detection are the core components of all modern memory systems. Current computing memory systems use simple coding schemes to simultaneously meet the resiliency and latency requirements. In this paper, we review our recent results on context-aware coding for computing memories, an approach that explicitly takes into account various intrinsic side information for improved robustness to faults. We discuss both error correction and detection, codes’ theoretical properties, and provide examples of how these solutions can be implemented in practice. We explicitly describe the special case of the error localization codes. We also discuss promising future directions and connections with classical information theoretic concepts.

[1]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[2]  Eiji Fujiwara,et al.  Single Byte Error Correcting—Double Byte Error Detecting Codes for Memory Systems , 1982, IEEE Transactions on Computers.

[3]  Frederic Sala,et al.  Context-aware resiliency: Unequal message protection for random-access memories , 2017, 2017 IEEE Information Theory Workshop (ITW).

[4]  Evyatar Hemo,et al.  d-Imbalance WOM Codes for Reduced Inter-Cell Interference in Multi-Level NVMs , 2016, IEEE J. Sel. Areas Commun..

[5]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[6]  Ying Wang,et al.  Joint Source-Channel Decoding of Polar Codes for Language-Based Sources , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[7]  Said Hamdioui,et al.  The state-of-art and future trends in testing embedded memories , 2004 .

[8]  Jiajing Wang,et al.  Minimum Supply Voltage and Yield Estimation for Large SRAMs Under Parametric Variations , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  L Shen,et al.  SINGLE BYTE ERROR CORRECTING DOUBLE BYTE ERROR DETECTING CODES , 1982 .

[10]  Lara Dolecek,et al.  Low-Cost Memory Fault Tolerance for IoT Devices , 2017, ACM Trans. Embed. Comput. Syst..

[11]  Andrew Waterman,et al.  The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[12]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[13]  Lara Dolecek,et al.  Software-Defined Error-Correcting Codes , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[14]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[15]  Frederic Sala,et al.  Channel Coding Methods for Non-Volatile Memories , 2016, Found. Trends Commun. Inf. Theory.

[16]  John Sartori,et al.  Low-power, low-storage-overhead chipkill correct via multi-line error correction , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[18]  John Sartori,et al.  High Performance, Energy Efficient Chipkill Correct Memory with Multidimensional Parity , 2013, IEEE Computer Architecture Letters.

[19]  Evyatar Hemo,et al.  $d$ -Imbalance WOM Codes for Reduced Inter-Cell Interference in Multi-Level NVMs , 2016, IEEE Journal on Selected Areas in Communications.

[20]  Wei Chen,et al.  The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series , 2007, IEEE Journal of Solid-State Circuits.

[21]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[22]  Lara Dolecek,et al.  Parity++: Lightweight Error Correction for Last Level Caches , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[23]  Hadi Esmaeilzadeh,et al.  AxBench: A Multiplatform Benchmark Suite for Approximate Computing , 2017, IEEE Design & Test.

[24]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Ying Wang,et al.  Exploiting source redundancy to improve the rate of polar codes , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[26]  Anxiao Jiang,et al.  Rank modulation for flash memories , 2008, 2008 IEEE International Symposium on Information Theory.

[27]  Yiannakis Sazeides,et al.  Modeling the implications of DRAM failures and protection techniques on datacenter TCO , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Paul H. Siegel,et al.  Constrained Codes that Mitigate Inter-Cell Interference in Read/Write Cycles for Flash Memories , 2014, IEEE Journal on Selected Areas in Communications.