Architectures for online error detection and recovery in multicore processors

The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or soft) errors, (b) latent manufacturing defects as well as aging/wearout phenomena that produce permanent (or hard) errors, and (c) verification inefficiencies that allow important design bugs to escape in the system. In an effort to cope with these reliability threats, several research teams have recently proposed multicore processor architectures that provide low-cost dependability guarantees against hardware errors and design bugs. This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation. It discusses taxonomy of representative approaches and presents a qualitative comparison based on: hardware cost, performance overhead, types of faults detected, and detection latency. It also describes in more detail three recently proposed effective architectural approaches: a software-anomaly detection technique (SWAT), a dynamic verification technique (Argus), and a core salvaging methodology.

[1]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  Sule Ozev,et al.  Tolerating hard faults in microprocessor array structures , 2004, International Conference on Dependable Systems and Networks, 2004.

[3]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[4]  Valeria Bertacco,et al.  Dacota: Post-silicon validation of the memory subsystem in multi-core designs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[5]  José Duato,et al.  A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Mihalis Psarakis,et al.  MT-SBST: Self-test optimization in multithreaded multicore architectures , 2010, 2010 IEEE International Test Conference.

[7]  Janusz Rajski,et al.  Logic BIST for large industrial designs: real issues and case studies , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[8]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[9]  Amin Ansari,et al.  The StageNet fabric for constructing resilient multicore systems , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[10]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[11]  Albert Meixner,et al.  Detouring: Translating software to circumvent hard faults in simple cores , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[12]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[13]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[15]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[16]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[17]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[19]  Daniel J. Sorin,et al.  Fault Tolerant Computer Architecture , 2009, Fault Tolerant Computer Architecture.

[20]  Jacques Henri Collet,et al.  Chip Self-Organization and Fault Tolerance in Massively Defective Multicore Arrays , 2011, IEEE Transactions on Dependable and Secure Computing.

[21]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[22]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[23]  Albert Meixner,et al.  Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[24]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[25]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[26]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[27]  Matteo Sonza Reorda,et al.  Microprocessor Software-Based Self-Testing , 2010, IEEE Design & Test of Computers.

[28]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[30]  T. N. Vijaykumar,et al.  Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[31]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[32]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[33]  Mihalis Psarakis,et al.  Software-Based Self-Testing of Symmetric Shared-Memory Multiprocessors , 2009, IEEE Transactions on Computers.

[34]  Gary S. Tyson,et al.  Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[35]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[36]  Dimitris Gizopoulos,et al.  Guest Editors' Introduction: Special Section on Dependable Computer Architecture , 2011, IEEE Trans. Computers.

[37]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[38]  S. Adve,et al.  LOW-COST HARDWARE FAULT DETECTION AND DIAGNOSIS FOR MULTICORE SYSTEMS RUNNING MULTITHREADED WORKLOADS , 2022 .

[39]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[40]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[41]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[42]  Sharad Malik,et al.  Runtime validation of memory ordering using constraint graph checking , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[43]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).