Multi-layer memory resiliency

With memories continuing to dominate the area, power, cost and performance of a design, there is a critical need to provision reliable, high-performance memory bandwidth for emerging applications. Memories are susceptible to degradation and failures from a wide range of manufacturing, operational and environmental effects, requiring a multi-layer hardware/software approach that can tolerate, adapt and even opportunistically exploit such effects. The overall memory hierarchy is also highly vulnerable to the adverse effects of variability and operational stress. After reviewing the major memory degradation and failure modes, this paper describes the challenges for dependability across the memory hierarchy, and outlines research efforts to achieve multi-layer memory resilience using a hardware/software approach. Two specific exemplars are used to illustrate multi-layer memory resilience: first we describe static and dynamic policies to achieve energy savings in caches using aggressive voltage scaling combined with disabling faulty blocks; and second we show how software characteristics can be exposed to the architecture in order to mitigate the aging of large register files in GPGPUs. These approaches can further benefit from semantic retention of application intent to enhance memory dependability across multiple abstraction levels, including applications, compilers, run-time systems, and hardware platforms.

[1]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[2]  Nikil D. Dutt,et al.  A novel NoC-based design for fault-tolerance of last-level caches in CMPs , 2012, CODES+ISSS '12.

[3]  Puneet Gupta,et al.  Hardware Variability-Aware Duty Cycling for Embedded Sensors , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[4]  Puneet Gupta,et al.  ViPZonE: OS-level memory variability-driven physical address zoning for energy savings , 2012, CODES+ISSS '12.

[5]  Avesta Sasan,et al.  A fault tolerant cache architecture for sub 500mV operation: resizable data composer cache (RDC-cache) , 2009, CASES '09.

[6]  Alexandru Nicolau,et al.  A Simple Mechanism for Improving the Accuracy and Efficiency of Instruction-Level Disambiguation , 1995, LCPC.

[7]  John Sartori,et al.  Stochastic computing: Embracing errors in architecture and design of processors and applications , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[8]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[9]  Jun Yang,et al.  Enhancing phase change memory lifetime through fine-grained current regulation and voltage upscaling , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[10]  Kaushik Roy,et al.  A 160 mV, fully differential, robust schmitt trigger based sub-threshold SRAM , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[11]  Swarup Bhunia,et al.  Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache , 2011, IEEE Transactions on Computers.

[12]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Mahdi Fazeli,et al.  FTSPM: A Fault-Tolerant ScratchPad Memory , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[14]  Ulf Schlichtmann,et al.  Workload- and instruction-aware timing analysis - The missing link between technology and system-level resilience , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[16]  Nikil D. Dutt,et al.  Modeling and analysis of fault-tolerant distributed memories for Networks-on-Chip , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Wei Zhang,et al.  Replication cache: a small fully associative cache to improve data cache reliability , 2005, IEEE Transactions on Computers.

[18]  Paul Ampadu,et al.  Breaking the energy Barrier in fault-tolerant caches for multicore systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Trevor N. Mudge,et al.  On-Chip Cache Device Scaling Limits and Effective Fault Repair Techniques in Future Nanoscale Technology , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[20]  Vikas Chandra Monitoring reliability in embedded processors - A multi-layer view , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[21]  Tajana Simunic,et al.  PDRAM: A hybrid PRAM and DRAM main memory system , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[22]  Nikil D. Dutt,et al.  ARGO: Aging-aware GPGPU register file allocation , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[23]  Nikil D. Dutt,et al.  HaVOC: A hybrid memory-aware virtualization layer for on-chip distributed ScratchPad and Non-Volatile Memories , 2012, DAC Design Automation Conference 2012.

[24]  Kaushik Roy,et al.  A process-tolerant cache architecture for improved yield in nanoscale technologies , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  Puneet Gupta,et al.  VarEMU: An emulation testbed for variability-aware software , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[26]  A. Chandrakasan,et al.  A 256kb Sub-threshold SRAM in 65nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[27]  Tei-Wei Kuo,et al.  Endurance Enhancement of Flash-Memory Storage, Systems: An Efficient Static Wear Leveling Design , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[28]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[29]  David Blaauw,et al.  Dynamic NBTI management using a 45nm multi-degradation sensor , 2010, IEEE Custom Integrated Circuits Conference 2010.

[30]  Amin Ansari,et al.  ZerehCache: Armoring cache architectures in high defect density technologies , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Engin Ipek,et al.  Dynamically replicated memory: building reliable systems from nanoscale resistive memories , 2010, ASPLOS 2010.

[32]  Enrico Macii,et al.  Dynamic Indexing: Leakage-Aging Co-Optimization for Caches , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[33]  Nikil D. Dutt,et al.  E-RoC: Embedded RAIDs-on-Chip for low power distributed dynamically managed reliable memories , 2011, 2011 Design, Automation & Test in Europe.

[34]  Lara Dolecek,et al.  Tackling intracell variability in TLC Flash through tensor product codes , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[35]  David Blaauw,et al.  Bubble Razor: An architecture-independent approach to timing-error detection and correction , 2012, 2012 IEEE International Solid-State Circuits Conference.

[36]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[37]  R. Iris Bahar,et al.  Flexible data allocation for scratch-pad memories to reduce NBTI effects , 2013, International Symposium on Quality Electronic Design (ISQED).

[38]  Puneet Gupta,et al.  Accurate and inexpensive performance monitoring for variability-aware systems , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[39]  Norman P. Jouppi,et al.  FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[40]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[41]  Nikil D. Dutt,et al.  FFT-Cache: A Flexible Fault-Tolerant Cache architecture for ultra low voltage operation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[42]  Steven Swanson,et al.  The Harey Tortoise: Managing Heterogeneous Write Performance in SSDs , 2013, USENIX Annual Technical Conference.

[43]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Muhammad Shafique,et al.  Multi-layer dependability: From microarchitecture to application level , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[45]  S. E. Schuster Multiple word/bit line redundancy for semiconductor memories , 1978 .

[46]  N. Rydbeck,et al.  PCM/TDMA satellite communication systems with error correcting and error detecting codes , 1976 .

[47]  Li-Pin Chang,et al.  On efficient wear leveling for large-scale flash-memory storage systems , 2007, SAC '07.

[48]  Puneet Gupta,et al.  Power / capacity scaling: Energy savings with simple fault-tolerant caches , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[49]  Kaushik Roy,et al.  A Scalable Circuit-Architecture Co-Design to Improve Memory Yield for High-Performance Processors , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[50]  Yuan Xie,et al.  i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[51]  Bernd Becker,et al.  Early-life-failure detection using SAT-based ATPG , 2013, 2013 IEEE International Test Conference (ITC).

[52]  Puneet Gupta,et al.  VaMV: Variability-aware Memory Virtualization , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[53]  Nikil D. Dutt,et al.  REMEDIATE: A scalable fault-tolerant architecture for low-power NUCA cache in tiled CMPs , 2013, 2013 International Green Computing Conference Proceedings.

[54]  H. Fujiwara,et al.  An Area-Conscious Low-Voltage-Oriented 8T-SRAM Design under DVS Environment , 2007, 2007 IEEE Symposium on VLSI Circuits.

[55]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[56]  Hai Zhou,et al.  Yield-Aware Cache Architectures , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[57]  Puneet Gupta,et al.  Power Variability in Contemporary DRAMs , 2012, IEEE Embedded Systems Letters.

[58]  Nikil D. Dutt,et al.  E < MC2: less energy through multi-copy cache , 2010, CASES '10.

[59]  Lara Dolecek,et al.  Underdesigned and Opportunistic Computing in Presence of Hardware Variability , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[60]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[61]  Nam Sung Kim,et al.  Minimizing total area of low-voltage SRAM arrays through joint optimization of cell size, redundancy, and ECC , 2010, 2010 IEEE International Conference on Computer Design.

[62]  Daniel J. Costello,et al.  Error Control Coding, Second Edition , 2004 .

[63]  A.P. Chandrakasan,et al.  A 256 kb 65 nm 8T Subthreshold SRAM Employing Sense-Amplifier Redundancy , 2008, IEEE Journal of Solid-State Circuits.

[64]  Edward J. McCluskey,et al.  PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[65]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[66]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[67]  Jaume Abella,et al.  Low Vccmin fault-tolerant cache with highly predictable performance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[68]  Nikil D. Dutt,et al.  SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[69]  Luca Benini,et al.  Aging-aware compiler-directed VLIW assignment for GPGPU architectures , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[70]  Hyunjin Lee,et al.  Flip-N-Write: A simple deterministic technique to improve PRAM write performance, energy and endurance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[71]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).