HHC: Hierarchical hardware checkpointing to accelerate fault recovery for SRAM-based FPGAs

As the feature size shrinks to the nanometer scale, SRAM-based FPGAs are increasingly vulnerable to soft errors. Checkpointing is an effective fault recovery technique that can restore the faulty system to its previous fault free state. Since the function of the system needs to be suspended during checkpoint saving and checkpoint restoring, so the Mean Time to Repair (MTTR) of the system is critical to the system performance. In this work, we propose a hierarchical hardware checkpointing (HHC) technique that contains a high-speed on-chip checkpoint and a low-speed off-chip checkpoint to accelerate fault recovery for SRAM-based FPGAs. Most of single event effect (SEE) faults can be recovered by the high-speed on-chip checkpoint, which significantly reduces the MTTR of the system. The memory resource occupation of the on-chip checkpoint is low because HHC only stores the logic states of user bits and check information for configuration bits. Experimental results show that, compared with traditional off-chip checkpoint strategies, the proposed technique can reduce the MTTR of the system by 94.30%. In addition, the memory resource occupation is 11.11% of FPGAs, a little high but can be further optimized.

[1]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[2]  Edward J. McCluskey,et al.  A memory coherence technique for online transient error recovery of FPGA configurations , 2001, FPGA '01.

[3]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[4]  C. Carmichael,et al.  Proton Testing of SEU Mitigation Methods for the Virtex FPGA , 2001 .

[5]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[7]  Mariette Awad,et al.  FPGA supercomputing platforms: A survey , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[8]  K. Tomko,et al.  Scan-chain based watch-points for efficient run-time debugging and verification of FPGA designs , 2003, Proceedings of the ASP-DAC Asia and South Pacific Design Automation Conference, 2003..

[9]  Brent E. Nelson,et al.  Using Design-Level Scan to Improve FPGA Design Observability and Controllability for Functional Verification , 2001, FPL.

[10]  Reinhard Männer,et al.  Multitasking on FPGA Coprocessors , 2000, FPL.

[11]  Alessandro Forin,et al.  gNOSIS: A Board-Level Debugging and Verification Tool , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[12]  Yu Hu,et al.  Robust FPGA resynthesis based on fault-tolerant Boolean matching , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[13]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[14]  M. Caffrey,et al.  Correcting single-event upsets through virtex partial configuration , 2000 .

[15]  L. Sterpone,et al.  A New Mitigation Approach for Soft Errors in Embedded Processors , 2008, IEEE Transactions on Nuclear Science.

[16]  Ricardo Reis,et al.  A low-cost SEE mitigation solution for soft-processors embedded in Systems on Pogrammable Chips , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[17]  Yu Hu,et al.  Cross-layer optimized placement and routing for FPGA soft error mitigation , 2011, 2011 Design, Automation & Test in Europe.

[18]  Christian Haubelt,et al.  Efficient hardware checkpointing: concepts, overhead analysis, and implementation , 2007, FPGA '07.

[19]  Neeraj Suri,et al.  EPIC: profiling the propagation and effect of data errors in software , 2004, IEEE Transactions on Computers.

[20]  Tarek A. El-Ghazawi,et al.  The Promise of High-Performance Reconfigurable Computing , 2008, Computer.

[21]  Heiko Kalte,et al.  Context saving and restoring for multitasking in reconfigurable systems , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[22]  Rudy Lauwereins,et al.  Infrastructure for design and management of relocatable tasks in a heterogeneous reconfigurable system-on-chip , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[23]  P. Graham,et al.  Radiation-induced multi-bit upsets in SRAM-based FPGAs , 2005, IEEE Transactions on Nuclear Science.

[24]  E. Normand Single event upset at ground level , 1996 .