Automatic instruction-level recovery by duplicated instructions and checkpointing

This paper proposes a software-based technique to achieve error detection and recovery at the instruction-level. The proposed technique is based on incorporation of instructions duplication and checkpointing. Similar to the previous study, all instructions are duplicated and appropriate “check” instructions are inserted to achieve error detection. After the error detection, checkpointing is used to regain the program correct state. Two optimization schemes: checksum and live variable analysis, are introduced to reduce the performance overhead. Experimental results show that most data errors can be recovered with a relative low performance overhead.

[1]  F. W. Sexton,et al.  Destructive single-event effects in semiconductor devices and ICs , 2003 .

[2]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[3]  Bao Liu Error-detecting/correcting-code-based self-checked/corrected/timed circuits , 2010, 2010 NASA/ESA Conference on Adaptive Hardware and Systems.

[4]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[5]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[6]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[7]  Lloyd W. Massengill,et al.  Basic mechanisms and modeling of single-event upset in digital microelectronics , 2003 .

[8]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[9]  Subhasish Mitra,et al.  S: Error Detection by Diverse Data and Duplicated Instructions , 2002 .

[10]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[11]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[12]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.