PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.

[1]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[2]  Paul Vickers,et al.  Somersault Software Fault-Tolerance , 1998 .

[3]  Martin Hiller,et al.  Executable assertions for detecting data errors in embedded control systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[4]  Kim M. Hazelwood,et al.  SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[5]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[6]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[7]  Johan Karlsson,et al.  Experimental evaluation of time-redundant execution for a brake-by-wire application , 2002, Proceedings International Conference on Dependable Systems and Networks.

[8]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[9]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Neeraj Suri,et al.  On the placement of software mechanisms for detection of data errors , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Y. C. Yeh,et al.  Triple-triple redundant 777 primary flight computer , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[14]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[15]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[16]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[17]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[18]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[19]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[20]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[21]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Dirk Grunwald,et al.  Shadow Profiling: Hiding Instrumentation Costs with Parallelism , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[23]  Anthony Skjellum,et al.  MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware , 2004, Cluster Computing.

[24]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[25]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[26]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[27]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[28]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[29]  Thomas C. Bressoud,et al.  TFT: a software system for application-transparent fault tolerance , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[30]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[31]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[32]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[33]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[34]  Derek Bruening,et al.  Maintaining consistency and bounding capacity of software code caches , 2005, International Symposium on Code Generation and Optimization.

[35]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[36]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[37]  T. N. Vijaykumar,et al.  Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[38]  K. Soumyanath,et al.  Scaling trends of cosmic ray induced soft errors in static latches beyond 0.18 /spl mu/ , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[39]  B. Zorn,et al.  Exterminator : Automatically Correcting Memory Errors , 2006 .

[40]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[41]  Emery D. Berger,et al.  Exterminator: automatically correcting memory errors with high probability , 2007, PLDI '07.

[42]  Robert W. Horst,et al.  Multiple instruction issue in the NonStop cyclone processor , 1990, ISCA '90.

[43]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[44]  Emery D. Berger,et al.  DieHard: probabilistic memory safety for unsafe languages , 2006, PLDI '06.

[45]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..