Detailed radiation fault modeling of the Remote Exploration and Experimentation (REE) first generation testbed architecture

The goal of the NASA HPCC Remote Exploration and Experimentation (REE) Project is to transfer commercial supercomputing technology into space. The project will use state of the art, low-power, non-radiation-hardened, COTS hardware chips and COTS software to the maximum extent possible, and will rely on software-implemented fault tolerance to provide the required levels of availability and reliability. We outline the methodology used to develop a detailed radiation fault model for the REE Testbed architecture. The model addresses the effects of energetic protons and heavy ions which cause single event upset and single event multiple upset events in digital logic devices and which are expected to be the primary fault generation mechanism. Unlike previous modeling efforts, this model will address fault rates and types in computer subsystems at a sufficiently fine level of granularity (i.e., the register level) that specific software and operational errors can be derived. We present the current state of the model, model verification activities and results to date, and plans for the future. Finally, we explain the methodology by which this model will be used to derive application-level error effects sets. These error effects sets will be used in conjunction with our Testbed fault injection capabilities and our applications' mission scenarios to replicate the predicted fault environment on our suite of onboard applications.

[1]  T. Yamada,et al.  Fault-tolerance experiments of the 'Hiten' onboard space computer , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[2]  Gwan Choi,et al.  The single event upset characteristics of the 486-DX4 microprocessor , 1997, 1997 IEEE Radiation Effects Data Workshop NSREC Snowmass 1997. Workshop Record Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference.

[3]  Marcus Rimén,et al.  A study of the effects of transient fault injection into a 32-bit RISC with built-in watchdog , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[4]  Johan Karlsson,et al.  On latching probability of particle induced transients in combinational networks , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[5]  A. H. Johnston Radiation effects in advanced microelectronics technologies , 1997 .

[6]  J. G. Tront,et al.  An HDL Simulation of the Effects of Single Event Upsets on Microprocessor Program Flow , 1984, IEEE Transactions on Nuclear Science.

[7]  Jan Torin,et al.  On microprocessor error behavior modeling , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[8]  Chung-Ho Chen,et al.  A cache protocol for error detection and recovery in fault-tolerant computing systems , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.