SuDoku: Tolerating High-Rate of Transient Failures for Enabling Scalable STTRAM

Conventionally, systems have relied on technology scaling to provide smaller cells, which helps in increasing the capacity of on-chip and off-chip structures. Unfortunately, scaling technology to smaller nodes causes increased susceptibility to faults. We study the problem of efficiently tolerating transient failures using scalable Spin-Transfer Torque RAM (STTRAM) as an example. At smaller feature sizes, the energy required to flip a STTRAM cell reduces, which makes these cells more susceptible to random failures caused by thermal noise. Such failures can be tolerated by periodic scrubbing and provisioning each line with Error Correction Code (ECC). However, to tolerate the desired bit-error-rate, the cache needs ECC-6 (six bit error correction) per line, incurring impractical storage overheads. Ideally, we want to tolerate these faults without relying on multi-bit ECC. We propose SuDoku, a design that provisions each line with ECC-1 and a strong error detection code, and relies on a region-based RAID-4 to perform correction of multi-bit errors. Unfortunately, simply having such a RAID-4 based architecture is ineffective at tolerating a high-rate of transient faults and provides an MTTF in the order of only a few seconds. We describe a novel data resurrection scheme that can repair multiple faulty lines in a RAID-4 region to increase the MTTF to several hours. We propose an extension of SuDoku, which hashes a given line into two regions of RAID-4 to significantly enhance reliability and increase the MTTF to trillions of hours. Our evaluations show that SuDoku provides 874x higher reliability than ECC-6, incurs 30% less storage than ECC-6, and performs within 0.1% of an ideal fault-free baseline.

[1]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[2]  Puneet Gupta,et al.  Comparative Evaluation of Spin-Transfer-Torque and Magnetoelectric Random Access Memory , 2016, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[3]  Rakesh Kumar,et al.  Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Chita R. Das,et al.  OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Dae-Hyun Kim,et al.  Architectural Support for Mitigating Row Hammering in DRAM Memories , 2015, IEEE Computer Architecture Letters.

[6]  Seth H. Pugsley,et al.  USIMM : the Utah SImulated Memory Module , 2012 .

[7]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[8]  Alper Buyuktosunoglu,et al.  Attaché: Towards Ideal Memory Compression by Mitigating Metadata Bandwidth Overheads , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[10]  Jinsuk Chung,et al.  CLEAN-ECC: High reliability ECC for adaptive granularity memory system , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[12]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[13]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Mircea R. Stan,et al.  Advances and Future Prospects of Spin-Transfer Torque Random Access Memory , 2010, IEEE Transactions on Magnetics.

[15]  Saied N. Tehrani,et al.  Thermally activated magnetization reversal in submicron magnetic tunnel junctions for magnetoresistive random access memory , 2002 .

[16]  Alexander Thomasian,et al.  RAID5 Performance with Distributed Sparing , 1997, IEEE Trans. Parallel Distributed Syst..

[17]  Rakesh Kumar,et al.  Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation , 2016, ISCA.

[18]  Onur Mutlu,et al.  AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Mattan Erez,et al.  Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[20]  G. Edward Suh,et al.  IVEC: off-chip memory integrity protection for both security and reliability , 2010, ISCA.

[21]  Sukhan Lee,et al.  CiDRA: A cache-inspired DRAM resilience architecture , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[22]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[23]  Gururaj Saileshwar,et al.  SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Sachin S. Sapatnekar,et al.  Improving STT-MRAM density through multibit error correction , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Anantha Chandrakasan,et al.  Challenges and Directions for Low-Voltage SRAM , 2011, IEEE Design & Test of Computers.

[26]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[27]  Yin Long,et al.  The detection and investigation of SRAM data retention soft failures by voltage contrast inspection , 2015, 2015 China Semiconductor Technology International Conference.

[28]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[29]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[30]  Wenqing Wu,et al.  Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Vijayalakshmi Srinivasan,et al.  Efficient scrub mechanisms for error-prone emerging memories , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[32]  Zeshan Chishti,et al.  Operating SECDED-based caches at ultra-low voltage with FLAIR , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[33]  Puneet Gupta,et al.  MEMRES: A Fast Memory System Reliability Simulator , 2016, IEEE Transactions on Reliability.

[34]  W. W. PETERSONt,et al.  Cyclic Codes for Error Detection * , 2022 .

[35]  Mehdi Baradaran Tahoori,et al.  A cross-layer analysis of Soft Error, aging and process variation in Near Threshold Computing , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[37]  Charles Slayman,et al.  Soft error trends and mitigation techniques in memory devices , 2011, 2011 Proceedings - Annual Reliability and Maintainability Symposium.

[38]  Moinuddin K. Qureshi,et al.  Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  Moinuddin K. Qureshi,et al.  Reducing read latency of phase change memory via early read and Turbo Read , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[40]  Prashant J. Nair,et al.  FAULTSIM : A fast , configurable memory-resilience simulator , 2014 .

[41]  Moinuddin K. Qureshi,et al.  FaultSim: A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems , 2016, ACM Trans. Archit. Code Optim..

[42]  Mattan Erez,et al.  RelaxFault Memory Repair , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[44]  Doe Hyun Yoon,et al.  Flexible cache error protection using an ECC FIFO , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[45]  Hiroshi Nakamura,et al.  7.2 4Mb STT-MRAM-based cache with memory-access-aware power optimization and write-verify-write / read-modify-write scheme , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[46]  Mattan Erez,et al.  All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[47]  Jaume Abella,et al.  Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[48]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[49]  Moinuddin K. Qureshi,et al.  XED: Exposing On-Die Error Detection Information for Strong Memory Reliability , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50]  Jacques-Olivier Klein,et al.  Failure and reliability analysis of STT-MRAM , 2012, Microelectron. Reliab..

[51]  Wei Zhang,et al.  A thermal and process variation aware MTJ switching model and its applications in soft error analysis , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[52]  S. Watts,et al.  Non-volatile Spin-Transfer Torque RAM (STT-RAM): An analysis of chip data, thermal stability and scalability , 2010, 2010 IEEE International Memory Workshop.

[53]  Moinuddin K. Qureshi,et al.  Reducing Refresh Power in Mobile Devices with Morphable ECC , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[54]  Somayeh Sardashti,et al.  Skewed Compressed Caches , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[55]  Onur Mutlu,et al.  Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[56]  Mikko H. Lipasti,et al.  COP: To compress and protect main memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[57]  Gaurav Ahuja,et al.  A 500 mV to 1.0 V 128 Kb SRAM in Sub 20 nm Bulk-FinFET Using Auto-adjustable Write Assist , 2014, 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems.

[58]  Mircea R. Stan,et al.  The Promise of Nanomagnetics and Spintronics for Future Logic and Universal Memory , 2010, Proceedings of the IEEE.

[59]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[60]  Moinuddin K. Qureshi,et al.  Enabling Transparent Memory-Compression for Commodity Memory Systems , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[61]  Xueti Tang,et al.  Spin-transfer torque magnetic random access memory (STT-MRAM) , 2013, JETC.

[62]  Hui Zhao,et al.  A Scaling Roadmap and Performance Evaluation of In-Plane and Perpendicular MTJ Based STT-MRAMs for High-Density Cache Memory , 2013, IEEE Journal of Solid-State Circuits.

[63]  Mircea R. Stan,et al.  Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[64]  Kaushik Roy,et al.  Write-optimized reliable design of STT MRAM , 2012, ISLPED '12.

[65]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.