Combining checkpointing and scrubbing in FPGA-based real-time systems

SRAM-based FPGAs provide an attractive solution for building high-performance embedded computing systems. Fault tolerant mechanisms are usually implemented in FPGA-based critical systems to improve their vulnerability to transient faults. Most fault tolerant approaches proposed so far in the literature for FPGA systems utilize checkpointing and scrubbing techniques for the fault recovery and repair operations, respectively, and rely on redundancy-based fault detection solutions. In this paper, we study the feasibility of building a low-cost fault-tolerant approach for FPGA-based realtime systems that combines checkpointing and scrubbing, the latter for both fault detection and repair. We calculate the checkpoint frequencies that guarantee the execution of the tasks within their deadlines in the presence of transient faults, taking into consideration the scrubbing time of the FPGA processor. Furthermore, we propose a selective scrubbing approach to reduce the scrubbing time and make feasible the fault tolerant execution of tasks with tight deadlines. We demonstrate the proposed approach in a Leon-3-based SoC in a Virtex-5 FPGA.

[1]  L. Carro,et al.  New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors , 2009, IEEE Transactions on Nuclear Science.

[2]  A. Lesea,et al.  The rosetta experiment: atmospheric soft error rate testing in differing technology FPGAs , 2005, IEEE Transactions on Device and Materials Reliability.

[3]  Jehoshua Bruck,et al.  An on-line algorithm for checkpoint placement , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[4]  Alan Burns,et al.  Feasibility analysis of fault-tolerant real-time task sets , 1996, Proceedings of the Eighth Euromicro Workshop on Real-Time Systems.

[5]  Yvon Savaria,et al.  Soft-error classification and impact analysis on real-time operating systems , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[6]  Hakan Aydin,et al.  Exact Fault-Sensitive Feasibility Analysis of Real-Time Tasks , 2007, IEEE Transactions on Computers.

[7]  Masahiro Iida,et al.  Improving the Robustness of a Softcore Processor against SEUs by Using TMR and Partial Reconfiguration , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[8]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[9]  Youngsoo Shin,et al.  Power conscious fixed priority scheduling for hard real-time systems , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[10]  Geert Deconinck,et al.  Fault-Tolerant Rate-Monotonic Scheduling Algorithm in Uniprocessor Embedded Systems , 2006, 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06).

[11]  S. Rezgui,et al.  Complex upset mitigation applied to a Re-configurable embedded processor , 2005, IEEE Transactions on Nuclear Science.

[12]  Ricardo Reis,et al.  A Low-Cost Solution for Deploying Processor Cores in Harsh Environments , 2011, IEEE Transactions on Industrial Electronics.

[13]  Mihalis Psarakis,et al.  Scrubbing-based SEU mitigation approach for Systems-on-Programmable-Chips , 2011, 2011 International Conference on Field-Programmable Technology.

[14]  Ricardo Reis,et al.  A low-cost SEE mitigation solution for soft-processors embedded in Systems on Pogrammable Chips , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[15]  D. Bortolato,et al.  Evaluating the effects of SEUs affecting the configuration memory of an SRAM-based FPGA , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[16]  Rami G. Melhem,et al.  A Nonpreemptive Real-Time Scheduler with Recovery from Transient Faults and Its Implementation , 2003, IEEE Trans. Software Eng..

[17]  Mehdi Baradaran Tahoori,et al.  Soft error mitigation for SRAM-based FPGAs , 2005, 23rd IEEE VLSI Test Symposium (VTS'05).

[18]  Ying Zhang,et al.  A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[19]  Ying Zhang,et al.  Energy-aware adaptive checkpointing in embedded real-time systems , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.