Survey of fault tolerance techniques for shared memory multicore/multiprocessor systems

With the advent of modern nano-scale technology, it has become possible to implement multiple processing cores on a single die. The shrinking transistor sizes however have made reliability a concern for such systems as smaller transistors are more prone to permanent as well as transient faults. To reduce the probability of failures of such systems, online fault tolerance techniques can be applied. These techniques need to be efficient as they execute concurrently with applications running on such systems. This paper discusses the challenges involved in online fault tolerance and existing work which tackles these challenges. We classify fault tolerance into four different steps which are proactive fault management, error detection, fault diagnosis and recovery and discuss related work for each step, with focus on techniques for shared memory multicore/multiprocessor systems. We also highlight the additional difficulties in tolerating faults for parallel execution on shared memory multicore/multiprocessor systems.

[1]  John P. Hayes,et al.  Online BIST for Embedded Systems , 1998, IEEE Des. Test Comput..

[2]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[3]  Satish Narayanasamy,et al.  Respec: efficient online multiprocessor replayvia speculation and external determinism , 2010, ASPLOS XV.

[4]  Renato J. O. Figueiredo,et al.  Towards Byzantine Fault Tolerance in Many-Core Computing Platforms , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[5]  Andrew Warfield,et al.  Safe Hardware Access with the Xen Virtual Machine Monitor , 2007 .

[6]  Derek Hower,et al.  Rerun: Exploiting Episodes for Lightweight Memory Race Recording , 2008, 2008 International Symposium on Computer Architecture.

[7]  Hongyu Sun,et al.  A SURVEY OF SOFTWARE FAULT TOLERANCE TECHNIQUES , 2005 .

[8]  Andrea Miele,et al.  A software framework for dynamic self-repair in embedded SoCs exploiting reconfigurable devices , 2010, 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR).

[9]  Josep Torrellas,et al.  Capo: a software-hardware interface for practical deterministic multiprocessor replay , 2009, ASPLOS.

[10]  Mark D. Hill,et al.  Karma: scalable deterministic record-replay , 2011, ICS '11.

[11]  David A. Wood,et al.  Calvin: Deterministic or not? Free will to choose , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[12]  K. Hiraki,et al.  Heterogeneous Functional Units for High Speed Fault-Tolerant Execution Stage , 2007 .

[13]  Christian Engelmann,et al.  A Framework for Proactive Fault Tolerance , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[14]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[15]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[17]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[18]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[19]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[20]  Bin Jiang,et al.  Hierarchical run time deadlock detection in process networks , 2008, 2008 IEEE Workshop on Signal Processing Systems.

[21]  Nicolas Ventroux,et al.  Analysis of on-line self-testing policies for real-time embedded multiprocessors in DSM technologies , 2010, 2010 IEEE 16th International On-Line Testing Symposium.

[22]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[23]  Stephen A. Edwards,et al.  SHIM: a deterministic model for heterogeneous embedded systems , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[25]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[26]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[27]  Ravishankar K. Iyer,et al.  Active replication of multithreaded applications , 2006, IEEE Transactions on Parallel and Distributed Systems.

[28]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[29]  Jonathan M. Smith,et al.  A Survey of Software Fault Tolerance Techniques , 1988 .

[30]  Scott Shenker,et al.  Diverse Replication for Single-Machine Byzantine-Fault Tolerance , 2008, USENIX Annual Technical Conference.

[31]  Dan Grossman,et al.  CoreDet: a compiler and runtime system for deterministic multithreaded execution , 2010, ASPLOS XV.

[32]  Sandip Kundu,et al.  BIST to Detect and Characterize Transient and Parametric Failures , 2010, IEEE Design & Test of Computers.

[33]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[34]  Yale N. Patt,et al.  Checkpoint Repair for High-Performance Out-of-Order Execution Machines , 1987, IEEE Transactions on Computers.

[35]  R. Baumann Soft errors in advanced semiconductor devices-part I: the three radiation sources , 2001 .

[36]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[37]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[38]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[39]  Karthikeyan Sankaralingam,et al.  Sampling + DMR: Practical and low-overhead permanent fault detection , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[40]  Yutaka Ishikawa,et al.  A New Concurrent Checkpoint Mechanism for Real-Time and Interactive Processes , 2010, 2010 IEEE 34th Annual Computer Software and Applications Conference.

[41]  Dimitris Gizopoulos,et al.  Software-based self-testing of embedded processors , 2005, IEEE Transactions on Computers.

[42]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[43]  Stefan Götz,et al.  Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines , 2004, OSDI.