SoRProcessor Cache DevicesMemory Application Libraries Operating System ( a ) Hardware − centric ( b ) Software − centric

Transient faults are emerging as a critical concern in the reliability of microprocessors. While hardware reliability techniques are often employed for transient fault tolerance, software techniques represent a more cost-effective and flexible alternative. This paper proposes a software approach to transient fault tolerance which utilizes a run-time system to automatically apply process-level redundancy (PLR). PLR creates a set of redundant processes per application process and compares the processes during run time to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources (i.e. extra threads or cores). PLR is a software-centric approach to transient fault tolerance in which the focus is shifted from ensuring correct hardware execution, to ensuring correct software execution. The software-centric approach is able to ignore many benign faults which do not propagate to affect the program output. In addition, the dynamic deployment creates a very flexible fault tolerant system which transparently applies PLR without prior modifications to the application, shared libraries, or operating system. Experiments using a real PLR prototype on an SMP machine demonstrate that PLR can effectively provide transient fault tolerance with a slowdown of only 1.26x.

[1]  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[2]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[3]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[4]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[5]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[6]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[7]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, ASPLOS IX.

[8]  Robert W. Horst,et al.  Multiple instruction issue in the NonStop Cyclone processor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[9]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[10]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[11]  Thomas C. Bressoud,et al.  TFT: a software system for application-transparent fault tolerance , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[12]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[13]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[14]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[15]  Neeraj Suri,et al.  On the placement of software mechanisms for detection of data errors , 2002, Proceedings International Conference on Dependable Systems and Networks.

[16]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[17]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[18]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[19]  Y. C. Yeh,et al.  Triple-triple redundant 777 primary flight computer , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[20]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[21]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Robert W. Horst,et al.  Multiple instruction issue in the NonStop cyclone processor , 1990, ISCA '90.

[23]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[24]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[25]  Joel S. Emer,et al.  Reducing the soft-error rate of a high-performance microprocessor , 2004, IEEE Micro.