Adaptive Reliability Chipkill Correct (ARCC)

Chipkill correct is an advanced type of error correction in memory that is popular among servers. Large field studies of memories have shown that chipkill correct reduces uncorrectable error rate by 4X [11] to 36X [12] compared to SECDED. Currently, there is a strong trade-off between power and reliability among different chipkill correct solutions. For example, commercially available chipkill correct solutions that can detect up to two failed devices and correct one (eg. SCCDCD) or two (eg. Double Chip Sparing) failed devices require accessing 36 DRAM devices per memory request. However, a weaker single chipkill correct single chipkill detect solution only requires accessing 18 devices per memory request and, therefore consumes much lower memory power. In this paper, we present Adaptive Reliability Chipkill Correct (ARCC) - an optimization to be applied to existing chipkill correct solutions to allow them to incur the low power consumption of a lower strength chipkill correct solution while maintaining similar reliability as that of a stronger chipkill correct solution. ARCC is based on the observation that, on average, only a tiny fraction of memory experiences any type of faults during the typical operational lifespan of a server. As such, it proposes relaxing the strength of chipkill correct in the beginning and then adaptively increasing the strength as needed on a page by page basis in order to reap the benefit of lower power consumption during the majority of the lifetime of a memory system. Our evaluation shows that ARCC reduces the power consumption of memory by 36%, on average, when applied to commercial SCCDCD, while keeping the storage overhead the same and maintaining similar reliability.

[1]  Rakesh Kumar,et al.  Reliability Models for Double Chipkill Detect/Correct Memory Systems , 2013 .

[2]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[3]  Subhrajit Bhattacharya,et al.  Blue Gene/L compute chip: Memory and Ethernet subsystem , 2005, IBM J. Res. Dev..

[4]  Jeffrey B. Rothman,et al.  Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[5]  Moinuddin K. Qureshi Pay-As-You-Go: Low-overhead hard-error correction for phase change memories , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[7]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[8]  Doe Hyun Yoon,et al.  Virtualized ECC: Flexible Reliability in Main Memory , 2011, IEEE Micro.

[9]  Norman P. Jouppi,et al.  LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).