mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems

Continued technology scaling is resulting in systems with billions of devices. Unfortunately, these devices are prone to failures from various sources, resulting in even commodity systems being affected by the growing reliability threat. Thus, traditional solutions involving high redundancy or piecemeal solutions targeting specific failure modes will no longer be viable owing to their high overheads. Recent reliability solutions have explored using low-cost monitors that watch for anomalous software behavior as a symptom of hardware faults. We previously proposed the SWAT system that uses such low-cost detectors to detect hardware faults, and a higher cost mechanism for diagnosis. However, all of the prior work in this context, including SWAT, assumes single-threaded applications and has not been demonstrated for multithreaded applications running on multicore systems. This paper presents mSWAT, the first work to apply symptom based detection and diagnosis for faults in multicore architectures running multithreaded software. For detection, we extend the symptom-based detectors in SWAT and show that they result in a very low Silent Data Corruption (SDC) rate for both permanent and transient hardware faults. For diagnosis, the multicore environment poses significant new challenges. First, deterministic replay required for SWAT's single-threaded diagnosis incurs higher overheads for multithreaded workloads. Second, the fault may propagate to fault-free cores resulting in symptoms from fault-free cores and no available known-good core, breaking fundamental assumptions of SWAT's diagnosis algorithm. We propose a novel permanent fault diagnosis algorithm for multithreaded applications running on multicore systems that uses a lightweight isolated deterministic replay to diagnose the faulty core with no prior knowledge of a known good core. Our results show that this technique successfully diagnoses over 95% of the detected permanent faults while incurring low hardware overheads. mSWAT thus offers an affordable solution to protect future multicore systems from hardware faults.

[1]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Derek Hower,et al.  Rerun: Exploiting Episodes for Lightweight Memory Race Recording , 2008, 2008 International Symposium on Computer Architecture.

[3]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[4]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[5]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[6]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[7]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[8]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[9]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[11]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[12]  Sule Ozev,et al.  Online diagnosis of hard faults in microprocessors , 2007, TACO.

[13]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[14]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[16]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[17]  Eric Rotenberg,et al.  Coverage of a microarchitecture-level fault check regimen in a superscalar processor , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[18]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[19]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[20]  John P. Hayes,et al.  Low-cost on-line fault detection using control flow assertions , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[21]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[22]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[23]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[24]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  Todd M. Austin,et al.  CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework , 2008, 2008 IEEE International Conference on Computer Design.

[26]  Ravishankar K. Iyer,et al.  An Architectural Framework for Detecting Process Hangs/Crashes , 2005, EDCC.

[27]  Eric Rotenberg,et al.  Assertion-Based Microarchitecture Design for Improved Fault Tolerance , 2006, 2006 International Conference on Computer Design.

[28]  Ravishankar K. Iyer,et al.  Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware , 2006, 2006 Sixth European Dependable Computing Conference.

[29]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[30]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[31]  Wi N Dows FLIGHT DATA RECORDER FOR , 2007 .

[32]  Amin Ansari,et al.  The StageNet fabric for constructing resilient multicore systems , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[33]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[34]  Huiyang Zhou,et al.  Unified Architectural Support for Soft-Error Protection or Software Bug Detection , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[35]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[36]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[37]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.