(Invited) Cross-Layer Resilience: Challenges, Insights, and the Road Ahead

Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by manufacturing and operating conditions, manufacturing test escapes, and early-life failures. Many publications have suggested that cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective resilience, is essential for designing cost-effective resilient digital systems. This paper presents a comprehensive overview of cross-layer resilience by addressing fundamental cross-layer resilience questions, by summarizing insights derived from recent advances in cross-layer resilience research, and by discussing future cross-layer resilience challenges. CCS CONCEPTS • General and reference $\rightarrow$ Reliability; • Hardware $\rightarrow$ Fault tolerance; • Computer systems organization $\rightarrow$ Reliability

[1]  H.-S. Philip Wong,et al.  14.3 A 43pJ/Cycle Non-Volatile Microcontroller with 4.7μs Shutdown/Wake-up Integrating 2.3-bit/Cell Resistive RAM and Resilience Techniques , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[2]  Rajiv V. Joshi,et al.  Resilient Low Voltage Accelerators for High Energy Efficiency , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Ulf Schlichtmann,et al.  Performance and Accuracy in Soft-Error Resilience Evaluation using the Multi-Level Processor Simulator ETISS-ML , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[4]  Jan M. Rabaey,et al.  Hyperdimensional Computing Exploiting Carbon Nanotube FETs, Resistive RAM, and Their Monolithic 3D Integration , 2018, IEEE Journal of Solid-State Circuits.

[5]  Gu-Yeon Wei,et al.  DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications , 2018, IEEE Journal of Solid-State Circuits.

[6]  Pradip Bose,et al.  Towards “Smarter” Vehicles Through Cloud-Backed Swarm Cognition , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[7]  Ulf Schlichtmann,et al.  ETISS-ML: A multi-level instruction set simulator with RTL-level fault injection support for the evaluation of cross-layer resiliency techniques , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Lukasz G. Szafaryn,et al.  Cross-Layer Resilience in Low-Voltage Digital Systems: Key Insights , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[9]  Eric Cheng,et al.  Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience) , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Pradip Bose,et al.  Invited paper: Resilient and energy-secure power management , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[11]  Onur Mutlu,et al.  The RowHammer problem and other issues we may face as memory becomes denser , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[12]  Eric Cheng,et al.  System-Level Effects of Soft Errors in Uncore Components , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Kunle Olukotun,et al.  Energy-Efficient Abundant-Data Computing: The N3XT 1,000x , 2015, Computer.

[15]  Jacob A. Abraham,et al.  In-depth soft error vulnerability analysis using synthetic benchmarks , 2015, 2015 IEEE 33rd VLSI Test Symposium (VTS).

[16]  Muhammad Shafique,et al.  Multi-layer dependability: From microarchitecture to application level , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Eric Cheng,et al.  The resilience wall: Cross-layer solution strategies , 2014, Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test.

[18]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[19]  Bernd Becker,et al.  Detection of early-life failures in high-K metal-gate transistors and ultra low-K inter-metal dielectrics , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[20]  Bernd Becker,et al.  Early-life-failure detection using SAT-based ATPG , 2013, 2013 IEEE International Test Conference (ITC).

[21]  Eric Cheng,et al.  Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study , 2013, 2013 IEEE International Test Conference (ITC).

[22]  R. Wong,et al.  Single-Event Performance and Layout Optimization of Flip-Flops in a 28-nm Bulk Technology , 2013, IEEE Transactions on Nuclear Science.

[23]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[24]  Meeta Sharma Gupta,et al.  Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Bishop Brock,et al.  Active management of timing guardband to save energy in POWER7 , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Stephen P. Boyd,et al.  Self-Tuning for Maximized Lifetime Energy-Efficiency in the Presence of Circuit Aging , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[27]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[28]  Hyunki Kim,et al.  Low-cost gate-oxide early-life failure detection in robust systems , 2010, 2010 Symposium on VLSI Circuits.

[29]  Ivan R. Linscott,et al.  LEAP: Layout Design through Error-Aware Transistor Positioning for soft-error resilient sequential cell design , 2010, 2010 IEEE International Reliability Physics Symposium.

[30]  Onur Mutlu,et al.  Concurrent autonomous self-test for uncore components in system-on-chips , 2010, 2010 28th VLSI Test Symposium (VTS).

[31]  Heather M. Quinn,et al.  Vision for cross-layer optimization to address the dual challenges of energy and reliability , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[32]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[33]  Onur Mutlu,et al.  Operating system scheduling for efficient online self-test in robust systems , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[34]  J. Tschanz,et al.  Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance , 2009, 2009 Symposium on VLSI Circuits.

[35]  Subhasish Mitra,et al.  Testing for Transistor Aging , 2009, 2009 27th IEEE VLSI Test Symposium.

[36]  Subhasish Mitra,et al.  CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[37]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[38]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[39]  Michael N. Lovellette,et al.  Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed , 2002, Proceedings, IEEE Aerospace Conference.

[40]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[41]  H.-S. Philip Wong,et al.  Resistive RAM With Multiple Bits Per Cell: Array-Level Demonstration of 3 Bits Per Cell , 2019, IEEE Transactions on Electron Devices.

[42]  Mary Wootters,et al.  The N3XT Approach to Energy-Efficient Abundant-Data Computing , 2019, Proceedings of the IEEE.

[43]  Jon C. Hiller,et al.  Report for the NSF Workshop on Cross ‐ layer Power Optimization and Management , 2012 .