Demonstrating HW–SW Transient Error Mitigation on the Single-Chip Cloud Computer Data Plane

Transient errors are a major concern for the correct operation of low-level cache memories. Aggressive integration requires effective mitigation of such errors, without extreme overheads in power, timing, or silicon area. We demonstrate a hybrid (hardware-software) scheme that mitigates bit flips in data that reside in low-level caches. The methodology is shown to be applicable in streaming applications and we illustrate that with a video decoding case study on a state-of-the-art many-core chip. The single-chip cloud computer is an experimental processor created by Intel Labs. Dedicated on-chip memories are utilized to keep safe copies for key application data, thus allowing rollbacks upon error detection. The experimental results illustrate the tradeoff between application delay, consumed energy, and output fidelity as the injected errors are corrected. When output fidelity is considered as a hard constraint, application slack used for mitigation can be reclaimed with dynamic frequency scaling. Output fidelity is guaranteed regardless of the error injection intensity and the application's timing constraints are respected up to a certain upper bound of error injection.

[1]  Sujit Dey,et al.  Evaluating Transient Error Effects in Digital Nanometer Circuits , 2007, IEEE Transactions on Reliability.

[2]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[3]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4]  David Flynn,et al.  Reliable State Retention-Based Embedded Processors Through Monitoring and Recovery , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  G.S. Moschytz,et al.  Practical fast 1-D DCT algorithms with 11 multiplications , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[8]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[9]  Vivek Sarkar,et al.  A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures , 2010, Euro-Par.

[10]  David Black-Schaffer,et al.  The HIPEAC vision for advanced computing in horizon 2020 , 2013 .

[11]  Luigi Carro,et al.  On the optimal design of triple modular redundancy logic for SRAM-based FPGAs , 2005, Design, Automation and Test in Europe.

[12]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13]  Young-Hyun Jun,et al.  45nm low-power embedded pseudo-SRAM with ECC-based auto-adjusted self-refresh scheme , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[14]  W. H. Robinson,et al.  Fault Simulation and Emulation Tools to Augment Radiation-Hardness Assurance Testing , 2013, IEEE Transactions on Nuclear Science.

[15]  H. Hecht,et al.  Accounting for soft errors in memory reliability prediction , 1989, Proceedings., Annual Reliability and Maintainability Symposium.

[16]  John Miano,et al.  Compressed image file formats , 1999 .

[17]  Tao Li,et al.  Managing multi-core soft-error reliability through utility-driven cross domain optimization , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[18]  Fa-Long Luo Digital Front-End in Wireless Communication and Broadcasting: Circuits and system integration in digital front-end , 2011 .

[19]  Nikil D. Dutt,et al.  Software Controlled Memories for Scalable Many-Core Architectures , 2012, 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[20]  M. Khellah,et al.  Effect of Power Supply Noise on SRAM Dynamic Stability , 2007, 2007 IEEE Symposium on VLSI Circuits.

[21]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[22]  Akash Kumar,et al.  Analysis, design and management of multimedia multiprocessor systems , 2009 .

[23]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[24]  Francky Catthoor,et al.  A hybrid HW-SW approach for intermittent error mitigation in streaming-based embedded systems , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Tom Fleischer,et al.  Compressed Image File Formats Jpeg Png Gif Xbm Bmp , 2016 .

[27]  Dimitrios Soudris,et al.  Designing Cmos Circuits For Low Power , 2011 .

[28]  Paul Ampadu,et al.  Transient and Permanent Error Co-management Method for Reliable Networks-on-Chip , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[29]  B.L. Bhuva,et al.  Soft Error Considerations for Multicore Microprocessor Design , 2007, 2007 IEEE International Conference on Integrated Circuit Design and Technology.

[30]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[31]  Richard H. Gumpertz Combining tags with error codes , 1983, ISCA '83.

[32]  Jeong-Taek Kong,et al.  CAD for nanometer silicon design challenges and success , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  Aviral Shrivastava,et al.  A compiler optimization to reduce soft errors in register files , 2009, LCTES '09.

[34]  Antonis Papanikolaou,et al.  Software mitigation of transient errors on the single-chip cloud computer , 2012 .

[35]  H. Fawcett Manual of Political Economy , 1995 .

[36]  Norbert Wehn,et al.  A Case Study in Reliability-Aware Design: A Resilient LDPC Code Decoder , 2008, 2008 Design, Automation and Test in Europe.

[37]  Bashir M. Al-Hashimi,et al.  Combined time and information redundancy for SEU-tolerance in energy-efficient real-time systems , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[38]  Sri Parameswaran,et al.  IMPRES: integrated monitoring for processor reliability and security , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[39]  Nicholas P. Carter,et al.  Design techniques for cross-layer resilience , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[40]  Heather M. Quinn,et al.  Vision for cross-layer optimization to address the dual challenges of energy and reliability , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[41]  Aamer Jaleel,et al.  Explaining cache SER anomaly using DUE AVF measurement , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[42]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[43]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[44]  Saurabh Dighe,et al.  A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[45]  Subhasish Mitra,et al.  Cross-layer resilience challenges: Metrics and optimization , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).