Low-Cost Memory Fault Tolerance for IoT Devices

IoT devices need reliable hardware at low cost. It is challenging to efficiently cope with both hard and soft faults in embedded scratchpad memories. To address this problem, we propose a two-step approach: FaultLink and Software-Defined Error-Localizing Codes (SDELC). FaultLink avoids hard faults found during testing by generating a custom-tailored application binary image for each individual chip. During software deployment-time, FaultLink optimally packs small sections of program code and data into fault-free segments of the memory address space and generates a custom linker script for a lazy-linking procedure. During run-time, SDELC deals with unpredictable soft faults via novel and inexpensive Ultra-Lightweight Error-Localizing Codes (UL-ELCs). These require fewer parity bits than single-error-correcting Hamming codes. Yet our UL-ELCs are more powerful than basic single-error-detecting parity: they localize single-bit errors to a specific chunk of a codeword. SDELC then heuristically recovers from these localized errors using a small embedded C library that exploits observable side information (SI) about the application’s memory contents. SI can be in the form of redundant data (value locality), legal/illegal instructions, etc. Our combined FaultLink+SDELC approach improves min-VDD by up to 440 mV and correctly recovers from up to 90% (70%) of random single-bit soft faults in data (instructions) with just three parity bits per 32-bit word.

[1]  F. J. Aichelmann Fault-Tolerant Design Techniques for Semiconductor Memory Applications , 1984, IBM J. Res. Dev..

[2]  Puneet Gupta,et al.  Hardware Variability-Aware Duty Cycling for Embedded Sensors , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Gedare Bloom,et al.  SuperGlue: IDL-Based, System-Level Fault Tolerance for Embedded Systems , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[4]  Kaushik Roy,et al.  Approximate storage for energy efficient spintronic memories , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Narayanan Vijaykrishnan,et al.  Working with Process Variation Aware Caches , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[7]  Puneet Gupta,et al.  ViPZonE: Hardware Power Variability-Aware Virtual Memory Management for Energy Savings , 2015, IEEE Transactions on Computers.

[8]  Chao Yan,et al.  Enabling Deep Voltage Scaling in Delay Sensitive L1 Caches , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[9]  Liangzhen Lai,et al.  Cross-Layer Approaches for Monitoring, Margining and Mitigation of Circuit Variability , 2015 .

[10]  Sparsh Mittal,et al.  A survey of architectural techniques for improving cache power efficiency , 2014, Sustain. Comput. Informatics Syst..

[11]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[12]  Jiajing Wang,et al.  Minimum Supply Voltage and Yield Estimation for Large SRAMs Under Parametric Variations , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[13]  Jun Yang,et al.  Frequent value compression in data caches , 2000, MICRO 33.

[14]  Seung-Soon Im,et al.  Tool interface standard (TIS) executable and linking format (ELF) specification , 1995 .

[15]  Francky Catthoor,et al.  OCEAN , 2014, ACM Trans. Embed. Comput. Syst..

[16]  Nikil D. Dutt,et al.  E-RoC: Embedded RAIDs-on-Chip for low power distributed dynamically managed reliable memories , 2011, 2011 Design, Automation & Test in Europe.

[17]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[18]  Lara Dolecek,et al.  Software-Defined Error-Correcting Codes , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[19]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[20]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Sparsh Mittal A Survey of Architectural Techniques for Managing Process Variation , 2016, ACM Comput. Surv..

[22]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[23]  Mahmut T. Kandemir,et al.  Improving scratch-pad memory reliability through compiler-guided data block duplication , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[24]  Tayyeb Mahmood,et al.  Ensuring Cache Reliability and Energy Scaling at Near-Threshold Voltage With Macho , 2015, IEEE Transactions on Computers.

[25]  Hadi Esmaeilzadeh,et al.  AxBench: A Multiplatform Benchmark Suite for Approximate Computing , 2017, IEEE Design & Test.

[26]  David A. Wood,et al.  Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[27]  Andrew Waterman,et al.  The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[28]  Seyed Ghassem Miremadi,et al.  A data recomputation approach for reliability improvement of scratchpad memory in embedded systems , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[29]  Said Hamdioui,et al.  The state-of-art and future trends in testing embedded memories , 2004, Records of the 2004 International Workshop on Memory Technology, Design and Testing, 2004..

[30]  Puneet Gupta,et al.  VaMV: Variability-aware Memory Virtualization , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[31]  Frederic Sala,et al.  NSF expedition on variability-aware software: Recent results and contributions , 2015, it Inf. Technol..

[32]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  Mark Schilling,et al.  The Surprising Predictability of Long Runs , 2012 .

[34]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration , 1998 .

[35]  Edward J. McCluskey,et al.  PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[36]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[37]  David Blaauw,et al.  Circuit and microarchitectural techniques for reducing cache leakage power , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[38]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[39]  Dhiraj K. Pradhan,et al.  Design Automation and Test in Europe (DATE) , 2014 .

[40]  Zeshan Chishti,et al.  Operating SECDED-based caches at ultra-low voltage with FLAIR , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[41]  Madhu Mutyam,et al.  Working with process variation aware caches , 2007 .

[42]  Puneet Gupta,et al.  RedCooper: Hardware Sensor Enabled Variability Software Testbed for Lifetime Energy Constrained Application , 2014 .

[43]  Jun Xu,et al.  Architecture Support for Defending Against Buffer Overflow Attacks , 2002 .

[44]  Puneet Gupta,et al.  DPCS: Dynamic Power/Capacity Scaling for SRAM Caches in the Nanoscale Era , 2015, ACM Trans. Archit. Code Optim..

[45]  Lara Dolecek,et al.  Underdesigned and Opportunistic Computing in Presence of Hardware Variability , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[46]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[47]  Jack K. Wolf,et al.  On an Extended Class of Error-Locating Codes , 1965, Inf. Control..

[48]  Kaushik Roy,et al.  A process-tolerant cache architecture for improved yield in nanoscale technologies , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[49]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[50]  Nikil D. Dutt,et al.  FFT-Cache: A Flexible Fault-Tolerant Cache architecture for ultra low voltage operation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[51]  Krste Asanovic,et al.  The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[52]  Eiji Fujiwara,et al.  A class of error locating codes for byte-organized memory systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[53]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[54]  Nikil D. Dutt,et al.  Multi-layer memory resiliency , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[55]  Said Hamdioui,et al.  March SS: a test for all static simple RAM faults , 2002, Proceedings of the 2002 IEEE International Workshop on Memory Technology, Design and Testing (MTDT2002).

[56]  Nikil D. Dutt,et al.  Exploiting Partially-Forgetful Memories for Approximate Computing , 2015, IEEE Embedded Systems Letters.

[57]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[58]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[59]  Mahdi Fazeli,et al.  Memory Mapped SPM: Protecting Instruction Scratchpad Memory in Embedded Systems against Soft Errors , 2012, 2012 Ninth European Dependable Computing Conference.

[60]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[61]  Amin Ansari,et al.  ZerehCache: Armoring cache architectures in high defect density technologies , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Mahdi Fazeli,et al.  FTSPM: A Fault-Tolerant ScratchPad Memory , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[63]  Jack K. Wolf,et al.  Error-locating codes-A new concept in error control , 1963, IEEE Trans. Inf. Theory.

[64]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[65]  Georgios Keramidas,et al.  A defect-aware reconfigurable cache architecture for low-Vccmin DVFS-enabled systems , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[66]  José Luís Almada Güntzel,et al.  A Post-compiling Approach that Exploits Code Granularity in Scratchpads to Improve Energy Efficiency , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[67]  S. E. Schuster Multiple word/bit line redundancy for semiconductor memories , 1978 .