Reli: Hardware/software Checkpoint and Recovery scheme for embedded processors

Checkpoint and Recovery (CR) allows computer systems to operate correctly even when compromised by transient faults. While many software systems and hardware systems for CR do exist, they are usually either too large, require major modifications to the software, too slow, or require extensive modifications to the caching schemes. In this paper, we propose a novel error-recovery management scheme, which is based upon re-engineering the instruction set. We take the native instruction set of the processor and enhance the microinstructions with additional micro-operations which enable checkpointing. The recovery mechanism is implemented by three custom instructions, which recover the registers which were changed, the data memory values which were changed and the special registers (PC, status registers etc.) which were changed. Our checkpointing storage is changed according to the benchmark executed. Results show that our method degrades performance by just 1.45% under fault free conditions, and incurs area overhead of 45% on average and 79% in the worst case. The recovery takes just 62 clock cycles (worst case) in the examples which we examined.

[1]  Jan M. Rabaey Design at the end of the silicon roadmap , 2005, ASP-DAC '05.

[2]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[3]  E. Normand Single event upset at ground level , 1996 .

[4]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[5]  Ravishankar K. Iyer,et al.  An architectural framework for providing reliability and security support , 2004, International Conference on Dependable Systems and Networks, 2004.

[6]  W. Kent Fuchs,et al.  CATCH-compiler-assisted techniques for checkpointing , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[7]  Dhabaleswar K. Panda,et al.  Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture , 2009, 2009 International Conference on High Performance Computing (HiPC).

[8]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[9]  Sri Parameswaran,et al.  IMPRES: integrated monitoring for processor reliability and security , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[10]  Yoshinori Takeuchi,et al.  A Processor Generation Method from Instruction Behavior Description Based on Specification of Pipeline Stages and Functional Units , 2007, 2007 Asia and South Pacific Design Automation Conference.

[11]  Ravishankar K. Iyer,et al.  An OS-level Framework for Providing Application-Aware Reliability , 2006, 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06).

[12]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Daniel P. Siewiorek,et al.  Reliable computer systems (2nd ed.): design and evaluation , 1992 .

[14]  Hai Lin,et al.  Architectural Enhancement and System Software Support for Program Code Integrity Monitoring in Application-Specific Instruction-Set Processors , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[16]  Sri Parameswaran,et al.  Rapid embedded hardware/software system generation , 2005, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design.

[17]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[18]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[19]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[20]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[21]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[22]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[23]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[24]  Josep Torrellas,et al.  SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback , 2006, IEEE Micro.

[25]  Sharad Malik,et al.  Challenges and Solutions for Late- and Post-Silicon Design , 2008, IEEE Design & Test of Computers.

[26]  Nikil Dutt,et al.  Processor Description Languages , 2008 .

[27]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[28]  Shuguang Feng,et al.  Cost-efficient soft error protection for embedded microprocessors , 2006, CASES '06.

[29]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[30]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.