Efficient hardware checkpointing: concepts, overhead analysis, and implementation

Progress in reconfigurable hardware technology allows the implementation of complete SoCs in today's FPGAs. In the context design for reliability, software checkpointing is an effective methodology to cope with faults. In this paper, we systematically extend the concept of checkpointing known from software systems to hardware tasks running on reconfigurable devices. We will classify different mechanisms for hardware checkpointing and present formulas for estimating the hardware overhead. Moreover, we will reveal a tool that takes over the burden of modifying hardware modules for checkpointing. Post-synthesis results of applying our methodology to different hardware accelerators will be presented and the results will be compared with the theoretical estimations.

[1]  Hideo Ito,et al.  Detecting, diagnosing, and tolerating faults in SRAM-based field programmable gate arrays: a survey , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[2]  Gordon J. Brebner,et al.  The swappable logic unit: a paradigm for virtual hardware , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[3]  Russell Tessier,et al.  Adaptive fault recovery for networked reconfigurable systems , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[4]  Heiko Kalte,et al.  Context saving and restoring for multitasking in reconfigurable systems , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[5]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[6]  Steven Trimberger,et al.  A time-multiplexed FPGA , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[7]  Marco Platzner,et al.  Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks , 2004, IEEE Transactions on Computers.

[8]  Brad L. Hutchings,et al.  Multitasking Hardware on the SLAAC1-V Reconfigurable Computing System , 2002, FPL.

[9]  Brent E. Nelson,et al.  Using Design-Level Scan to Improve FPGA Design Observability and Controllability for Functional Verification , 2001, FPL.

[10]  Reinhard Männer,et al.  Multitasking on FPGA Coprocessors , 2000, FPL.

[11]  Edward J. McCluskey,et al.  Transient errors and rollback recovery in LZ compression , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[12]  Stephen M. Scalera,et al.  The design and implementation of a context switching FPGA , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).