Predicting Single Event Effects in DRAM

The ability to leverage commodity memory in harsh environments due to radiation has the potential advance computing capability for aerospace and nuclear applications, among others. In this work, we provide the first demonstration of the existence of a small number of weak cells to single event effects for DDR3 memory when exposed to radiation. Thus, a high proportion of single event faults are actually not entirely random and can be predicted with high accuracy. We also demonstrate a classification of single event effects into predictable single cell, unpredictable single cell, and correlated multi-cell persistent faults, the latter due to latch-up effects. We further show that through classification, we can partition faults, which allows the development of a holistic framework to provide enhanced protection of the DRAM memory. This framework leverages a fault map with bit sparing to protect against faults from weak cells in conjunction with Chipkill ECC to effectively correct chip-level and random errors. This protection provides a potential path to the use of commodity DRAM memory in high radiation environments with extremely low fault rates. Our results, based on data from a multi-day radiation beam experiment, indicate a reduction in uncorrectable bit error rate for rows containing a weak cell by a factor of $\geq 10^{7}$ compared to Chipkill alone.

[1]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Sanghyeon Baeg,et al.  Stuck Bits Study in DDR3 SDRAMs Using 45-MeV Proton Beam , 2015, IEEE Transactions on Nuclear Science.

[3]  R. Ladbury,et al.  Radiation Performance of 1 Gbit DDR SDRAMs Fabricated in the 90 nm CMOS Technology Node , 2006, 2006 IEEE Radiation Effects Data Workshop.

[4]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[5]  L. Scheick,et al.  Analysis of radiation effects on individual DRAM cells , 2000 .

[6]  Rami G. Melhem,et al.  Sustainable fault management and error correction for next-generation main memories , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[7]  Sukhan Lee,et al.  CiDRA: A cache-inspired DRAM resilience architecture , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[8]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[9]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Chia-Lin Yang,et al.  SECRET: Selective error correction for refresh energy reduction in DRAMs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[11]  Onur Mutlu,et al.  PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[12]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[13]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[14]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[15]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).