Tuning Memory Fault Tolerance on the Edge

Error correction and fault tolerance have become pivotal considerations as conventional memories scale and emerging memories come to market. The common thread in these reliability challenges is that deep scaling reveals outliers in the memory system, which are responsible for the vast majority of faults. These cells, which may be attributed to process variation or undetected fabrication defects, tend to be more vulnerable to various forms of crosstalk, read- and write-disturbance, and even radiation-induced faults. By tracking faults in memory cells, identifying the worst offenders, and mitigating their effects accordingly, we can design dramatically improved fault tolerance techniques that are tuned to the fault characteristics of the memory at hand. A critical piece is the development of scalable and fault tolerance registries to track and retain critical information about these faults. The fault registries must be able to function in the faulty memory they protect, operate efficiently at the cell/bit-level, and handle extreme fault rates. Using the knowledge of faulty locations, our fault tolerance techniques applied to conventional main memories like DRAM and endurance-limited memories like flash and phase-change memory improve reliability, endurance, and lifetime by orders of magnitude while maintaining performance and energy efficiency.

[1]  Alex K. Jones,et al.  FLOWER and FaME: A Low Overhead Bit-Level Fault-map and Fault-Tolerance Approach for Deeply Scaled Memories , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Rami G. Melhem,et al.  Mitigating Wordline Crosstalk Using Adaptive Trees of Counters , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[3]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[4]  Rami G. Melhem,et al.  Yoda: Judge Me by My Size, Do You? , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[5]  Alex K. Jones,et al.  RETROFIT: Fault-Aware Wear Leveling , 2018, IEEE Computer Architecture Letters.

[6]  Rami G. Melhem,et al.  Mitigating bitline crosstalk noise in DRAM memories , 2017, MEMSYS.

[7]  Alex K. Jones,et al.  Predicting and mitigating single-event upsets in DRAM using HOTH , 2021 .

[8]  Alex K. Jones,et al.  Data Block Partitioning Methods to Mitigate Stuck-At Faults in Limited Endurance Memories , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Alex K. Jones,et al.  A CASTLE With TOWERs for Reliable, Secure Phase-Change Memory , 2021, IEEE Transactions on Computers.