Predicting and mitigating single-event upsets in DRAM using HOTH

Abstract There is a growing demand for using commodity memory and storage solutions to make commercial aerospace ventures economically feasible. Existing radiation-hardened computer systems cannot meet this need alone. These hardened systems provide sufficient protection against the harsh environment of the upper atmosphere and low-Earth orbit, but require dramatically increased cost and utilize commercially out of date architectures and fabrication technologies. If new aerospace systems can take advantage of the latest commodity memories, they can leverage relevant advanced fabrication processes and the economy of scale to control costs. Of course, such systems would require new strategies to maintain appropriate tolerance and/or resilience to faults from the harsh environment. In this work, we observe that single-event effects (SEEs) in recent generation DRAM memories are not entirely random, and in fact are often highly predictable under neutron radiation bombardment. We demonstrate the existence of a small number of weak cells responsible for the vast majority of single-bit, SEEs. Based on this observation, we present a memory fault mapping and tolerance approach called HOTH to mitigate these predictable fault modes in conjunction with more random/unpredictable SEEs in DDR3 memory. In HOTH, both single- and multi-bit effects can be mitigated individually at runtime using a combination of existing error-correcting code techniques in Chipkill ECC and a fault map framework. The HOTH fault map is stored in the same DRAM that is subject to SEEs and leverages a fault-tolerance approach to mitigate SEEs that might appear in that part of the storage. Using data from different memory DIMMs, form factors, and radiation incidence angles we show that with HOTH we can improve uncorrectable fault rate by at least ten orders of magnitude and increase mean-time-to-failure to thousands of years, allowing extended service times in harsh environments.

[1]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[2]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[3]  Norman P. Jouppi,et al.  LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[4]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[5]  L. Borucki,et al.  Comparison of accelerated DRAM soft error rates measured at component and system level , 2008, 2008 IEEE International Reliability Physics Symposium.

[6]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Doe Hyun Yoon,et al.  Virtualized ECC: Flexible Reliability in Main Memory , 2011, IEEE Micro.

[8]  Chia-Lin Yang,et al.  SECRET: Selective error correction for refresh energy reduction in DRAMs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[9]  G. Schindlbeck,et al.  Impact of DRAM process technology on neutron-induced soft errors , 2007, 2007 IEEE International Integrated Reliability Workshop Final Report.

[10]  R. Ladbury,et al.  Radiation Performance of 1 Gbit DDR SDRAMs Fabricated in the 90 nm CMOS Technology Node , 2006, 2006 IEEE Radiation Effects Data Workshop.

[11]  Rami G. Melhem,et al.  Mitigating Wordline Crosstalk Using Adaptive Trees of Counters , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[12]  Sukhan Lee,et al.  CiDRA: A cache-inspired DRAM resilience architecture , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[13]  Kinam Kim,et al.  Technology for sub-50nm DRAM and NAND flash manufacturing , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[14]  Jong-Ho Kang,et al.  A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture , 2012, 2012 IEEE International Solid-State Circuits Conference.

[15]  R. J. Peterson Radiation-induced errors in memory chips , 2003 .

[16]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[17]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[18]  Marta Bagatin,et al.  Ionizing Radiation Effects in Electronics : From Memories to Imagers , 2015 .

[19]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[20]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[22]  Onur Mutlu,et al.  Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[23]  I. Reed,et al.  Polynomial Codes Over Certain Finite Fields , 1960 .

[24]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[25]  Onur Mutlu,et al.  PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[26]  Robert H. Dennard,et al.  Challenges and future directions for the scaling of dynamic random-access memory (DRAM) , 2002, IBM J. Res. Dev..

[27]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[28]  Alex K. Jones,et al.  FLOWER and FaME: A Low Overhead Bit-Level Fault-map and Fault-Tolerance Approach for Deeply Scaled Memories , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Thomas Yang,et al.  Trap-Assisted DRAM Row Hammer Effect , 2019, IEEE Electron Device Letters.

[30]  Sanghyeon Baeg,et al.  Stuck Bits Study in DDR3 SDRAMs Using 45-MeV Proton Beam , 2015, IEEE Transactions on Nuclear Science.

[31]  Jung Ho Ahn,et al.  MAGE: Adaptive Granularity and ECC for resilient and power efficient memory systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Onur Mutlu,et al.  The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[33]  Y. Konishi,et al.  Analysis of coupling noise between adjacent bit lines in megabit DRAMs , 1989 .

[34]  L. Scheick,et al.  Analysis of radiation effects on individual DRAM cells , 2000 .

[35]  Rami G. Melhem,et al.  Sustainable fault management and error correction for next-generation main memories , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[36]  Christopher Wilson,et al.  CSP: A Multifaceted Hybrid Architecture for Space Computing , 2014 .

[37]  Alex K. Jones,et al.  Predicting Single Event Effects in DRAM , 2019, 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).