Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing

Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the potential to provide space applications with the necessary performance to meet next-generation mission requirements. However, mitigating an FPGA’s susceptibility to single-event upset (SEU) radiation is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce these overheads while still providing sufficient radiation mitigation, we propose a reconfigurable fault tolerance (RFT) framework that enables system designers to dynamically adjust a system’s level of redundancy and fault mitigation based on the varying radiation incurred at different orbital positions. This framework includes an adaptive hardware architecture that leverages FPGA reconfigurable techniques to enable significant processing to be performed efficiently and reliably when environmental factors permit. To accurately estimate upset rates, we propose an upset rate modeling tool that captures time-varying radiation effects for arbitrary satellite orbits using a collection of existing, publically available tools and models. We perform fault-injection testing on a prototype RFT platform to validate the RFT architecture and RFT performability models. We combine our RFT hardware architecture and the modeled upset rates using phased-mission Markov modeling to estimate performability gains achievable using our framework for two case-study orbits.

[1]  C. Carmichael,et al.  Static Upset Characteristics of the 90nm Virtex-4QV FPGAs , 2008, 2008 IEEE Radiation Effects Data Workshop.

[2]  Hana Kubatova,et al.  Dependability computations for fault-tolerant system based on FPGA , 2005, 2005 12th IEEE International Conference on Electronics, Circuits and Systems.

[3]  Bruno Sericola,et al.  Performability Analysis Using Semi-Markov Reard Processes , 1990, IEEE Trans. Computers.

[4]  John Williams,et al.  Reconfigurable FPGAS for real time image processing in space , 2002, 2002 14th International Conference on Digital Signal Processing Proceedings. DSP 2002 (Cat. No.02TH8628).

[5]  Nils Olsen,et al.  The 10th generation international geomagnetic reference field , 2005 .

[6]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[7]  M. Alam,et al.  Reliability analysis of phased-mission systems: a practical approach , 2006, RAMS '06. Annual Reliability and Maintainability Symposium, 2006..

[8]  Dionisios N. Pnevmatikatos,et al.  A novel SRAM-based FPGA architecture for efficient TMR fault tolerance support , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[9]  J. A. Abraham,et al.  An object-oriented approach for implementing algorithm-based fault tolerance , 1993, Proceedings of Phoenix Conference on Computers and Communications.

[10]  Tom Flatley Advanced Hybrid On-Board Science Data Processor - SpaceCube 2.0 , 2010 .

[11]  K. Chapman SEU Strategies for Virtex-5 Devices , 2010 .

[12]  S. Hensley,et al.  Onboard FPGA-based SAR processing for future spaceborne systems , 2004, Proceedings of the 2004 IEEE Radar Conference (IEEE Cat. No.04CH37509).

[13]  Naresh R. Shanbhag,et al.  Reliable low-power digital signal processing via reduced precision redundancy , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[14]  Jean Arlat,et al.  Definition and analysis of hardware- and software-fault-tolerant architectures , 1990, Computer.

[15]  Helia Naeimi,et al.  Fault-tolerant sub-lithographic design with rollback recovery. , 2008, Nanotechnology.

[16]  Adrian Thompson,et al.  Scrubbing away transients and jiggling around the permanent: long survival of FPGA systems through evolutionary self-repair , 2004, Proceedings. 10th IEEE International On-Line Testing Symposium.

[17]  I.A. Troxel,et al.  Achieving Multipurpose Space Imaging with the ARTEMIS Reconfigurable Payload Processor , 2008, 2008 IEEE Aerospace Conference.

[18]  K. Kim,et al.  Phased-mission system reliability under Markov environment , 1994 .

[19]  J. Johnson,et al.  Using Duplication with Compare for On-line Error Detection in FPGA-based Designs , 2008, 2008 IEEE Aerospace Conference.

[20]  M. Shea,et al.  CREME96: A Revision of the Cosmic Ray Effects on Micro-Electronics Code , 1997 .

[21]  Masanori Hashimoto,et al.  Coarse-grained dynamically reconfigurable architecture with flexible reliability , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[22]  J. J. Wang,et al.  Radiation effects in FPGAs , 2003 .

[23]  Felix R. Hoots,et al.  SPACETRACK REPORT NO. 3 Models for Propagation of , 1988 .

[24]  Niraj K. Jha,et al.  Algorithm-Based Fault Tolerance for FFT Networks , 1994, IEEE Trans. Computers.

[25]  M. Caffrey,et al.  SEU Mitigation Techniques for Virtex FPGAs in Space Applications , 1999 .

[26]  M. Wirthlin,et al.  Improving FPGA Design Robustness with Partial TMR , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[27]  Alan D. George,et al.  Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration , 2010, TRETS.

[28]  Felix R. Hoots,et al.  Models for Propagation of NORAD Element Sets , 1980 .

[29]  D GeorgeAlan,et al.  Reconfigurable Fault Tolerance , 2012 .

[30]  David S. Taubman,et al.  Realizing Low-Cost High-Throughput General-Purpose Block Encoder for JPEG2000 , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  John F. Meyer,et al.  Closed-Form Solutions of Performability , 1982, IEEE Transactions on Computers.

[32]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[33]  D.L. McMurtrey,et al.  A Comparison of TMR With Alternative Fault-Tolerant Design Techniques for FPGAs , 2007, IEEE Transactions on Nuclear Science.

[34]  Eiji Fujiwara,et al.  Applications to Computer Systems , 1990 .

[35]  Henrique Madeira,et al.  Practical issues in the use of ABFT and a new failure model , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[36]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[37]  Chein-I Chang,et al.  Field Programmable Gate Arrays (FPGA) for Pixel Purity Index Using Blocks of Skewers for Endmember Extraction in Hyperspectral Imagery , 2008, Int. J. High Perform. Comput. Appl..

[38]  Alan D. George,et al.  Acceleration of FPGA Fault Injection Through Multi-Bit Testing , 2010, ERSA.

[39]  Kishor S. Trivedi,et al.  Reliability Modeling Using SHARPE , 1987, IEEE Transactions on Reliability.

[40]  Eiji Fujiwara,et al.  Error-control coding for computer systems , 1989 .