A survey of checker architectures

Reliability is quickly becoming a primary design constraint for high-end processors because of the inherent limits of manufacturability, extreme miniaturization of transistors, and the growing complexity of large multicore chips. To achieve a high degree of fault tolerance, we need to detect faults quickly and try to rectify them. In this article, we focus on the former aspect. We present a survey of different kinds of fault detection mechanisms for processors at circuit, architecture, and software level. We collectively refer to such mechanisms as checker architectures. First, we propose a novel two-level taxonomy for different classes of checkers based on their structure and functionality. Subsequently, for each class we present the ideas in some of the seminal papers that have defined the direction of the area along with important extensions published in later work.

[1]  Waleed Dweik,et al.  WearMon: Reliability monitoring using adaptive critical path testing , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[2]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[3]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[4]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[5]  Valeria Bertacco,et al.  Application-Aware diagnosis of runtime hardware faults , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[6]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[7]  Todd M. Austin,et al.  Efficient checker processor design , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[8]  Jacob A. Abraham,et al.  Evaluation of integrated system-level checks for on-line error detection , 1996, Proceedings of IEEE International Computer Performance and Dependability Symposium.

[9]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[10]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[11]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[12]  Onur Mutlu,et al.  Online design bug detection: RTL analysis, flexible mechanisms, and evaluation , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[13]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[14]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[15]  Josep Torrellas,et al.  CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[16]  Francine Bacchini,et al.  Verification: what works and what doesn't , 2004, DAC '04.

[17]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[18]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[19]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[20]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[21]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[24]  Alan J. Hu,et al.  Improving multiple-CMP systems using token coherence , 2005, 11th International Symposium on High-Performance Computer Architecture.

[25]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[26]  Manoj Franklin,et al.  Hierarchical Verification for Increasing Performance in Reliable Processors , 2008, J. Electron. Test..

[27]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[28]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization , 2002, MICRO.

[29]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[30]  Anand Sivasubramaniam,et al.  A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[32]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[33]  Josep Torrellas,et al.  Techniques to Mitigate the Effects of Congenital Faults in Processors , 2008 .

[34]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[35]  John P. Hayes,et al.  Low-cost on-line fault detection using control flow assertions , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[36]  Daniel Sánchez,et al.  Extending SRT for parallel applications in tiled-CMP architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[37]  T. N. Vijaykumar,et al.  Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[38]  Michael C. Huang,et al.  Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[39]  Arun Kumar Somani,et al.  Soft Error Mitigation Schemes for High Performance and Aggressive Designs , 2009 .

[40]  Cheng Wang,et al.  Software-based transparent and comprehensive control-flow error detection , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[41]  Albert Meixner,et al.  Dynamic Verification of Sequential Consistency , 2005, ISCA 2005.

[42]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[43]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[44]  Necromancer: enhancing system throughput by animating dead cores , 2010, ISCA '10.

[45]  Shuai Wang,et al.  Resource-Driven Optimizations for Transient-Fault Detecting SuperScalar Microarchitectures , 2005, Asia-Pacific Computer Systems Architecture Conference.

[46]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[47]  Edward J. McCluskey,et al.  The Watchdog Task: Concurrent error detection using assertions , 1985 .

[48]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[49]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[50]  M. Blum,et al.  Reflections on the Pentium Division Bug , 1995 .

[51]  Mikko H. Lipasti,et al.  Dynamic Verification of Cache Coherence Protocols , 2004 .

[52]  José Duato,et al.  A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[53]  Daniel J. Sorin,et al.  Specifying and dynamically verifying address translation-aware memory consistency , 2010, ASPLOS XV.

[54]  John Paul Shen,et al.  Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring , 1994, IEEE Trans. Computers.

[55]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[56]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[57]  Albert Meixner,et al.  Dynamic verification of sequential consistency , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[58]  Mihalis Psarakis,et al.  Accelerating microprocessor silicon validation by exposing ISA diversity , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[60]  Jacob A. Abraham,et al.  CEDA: control-flow error detection through assertions , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).

[61]  Onur Mutlu,et al.  Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[62]  Arun K. Somani,et al.  REESE: a method of soft error detection in microprocessors , 2001, 2001 International Conference on Dependable Systems and Networks.

[63]  Yi Ma,et al.  Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery , 2007 .

[64]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[65]  Aneesh Aggarwal,et al.  Speculative instruction validation for performance-reliability trade-off , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[66]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[67]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[68]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[69]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[70]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-aware Error Detectors using Static Analysis , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[71]  Daniel J. Sorin,et al.  Specifying and dynamically verifying address translation-aware memory consistency , 2010, ASPLOS 2010.

[72]  Jaume Abella,et al.  Implementing End-to-End Register Data-Flow Continuous Self-Test , 2009, IEEE Transactions on Computers.

[73]  Sridhar Narayanan,et al.  TSOtool: a program for verifying memory systems using the memory consistency model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[74]  José M. García,et al.  REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs , 2009, Euro-Par.

[75]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[76]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[77]  T. N. Vijaykumar,et al.  Opportunistic Transient-Fault Detection , 2005, ISCA 2005.

[78]  Mark D. Hill,et al.  Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.

[79]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[80]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[81]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[82]  Robert W. Horst,et al.  The Hardware Architecture and Linear Expansion of Tandem NonStop Systems , 2002 .

[83]  Arshad Jhumka,et al.  A methodology for the generation of efficient error detection mechanisms , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[84]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[85]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[86]  Michael C. Huang,et al.  A performance-correctness explicitly-decoupled architecture , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[87]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[88]  Albert Meixner,et al.  Error Detection Using Dynamic Dataflow Verification , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[89]  Yervant Zorian,et al.  Principles of testing electronic systems , 2000 .

[90]  Yi Ma,et al.  Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery , 2007, IEEE Transactions on Parallel and Distributed Systems.

[91]  James C. Hoe,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[92]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[93]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[94]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[95]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[96]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[97]  Suku Nair,et al.  Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection , 1999, IEEE Trans. Parallel Distributed Syst..

[98]  Satish Narayanasamy,et al.  Patching Processor Design Errors with Programmable Hardware , 2007, IEEE Micro.

[99]  Sharad Malik,et al.  Runtime validation of memory ordering using constraint graph checking , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[100]  Albert Meixner,et al.  Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[101]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[102]  Tianshi Chen,et al.  Fast complete memory consistency verification , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[103]  David Blaauw,et al.  Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[104]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[105]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[106]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[107]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[108]  Karthikeyan Sankaralingam,et al.  Sampling + DMR: Practical and low-overhead permanent fault detection , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[109]  Kewal K. Saluja,et al.  Energy-efficient fault tolerance in chip multiprocessors using Critical Value Forwarding , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).