A Software-Based Redundant Execution Programming Model for Transient Fault Detection and Correction

Software reliability is becoming increasingly important as computer systems assume ever greater roles in our everyday life. This paper proposes a software-based redundant execution programming model for transient fault detection and correction. A multi-threading technique is introduced to handle thread-level redundant execution for fault detection, and majority voting is used to recover from errors. A watchdog thread is used to cope with no-response threads. Preliminary experiments for benchmark programs show that the proposed programming model can detect errors from transient faults and that the majority voting strategy can correctly resume program execution. Application of the proposed model will improve programs' fault tolerance.

[1]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[2]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[3]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[4]  Pia Sanda,et al.  Soft Errors: Technology Trends, System Effects, and Protection Techniques , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[5]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[6]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[7]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Paul Vickers,et al.  Somersault Software Fault-Tolerance , 1998 .

[9]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).