A software-implemented fault injection methodology for design and validation of system fault tolerance

Presents our experience in developing a methodology and tool at the Jet Propulsion Laboratory (JPL) for software-implemented fault injection (SWIFI) into a parallel-processing supercomputer which is being designed for use in next-generation space exploration missions. The fault injector uses software-based strategies to emulate the effects of radiation-induced transients occurring in the system hardware components. JPL's SWIFI tool set, which is called JIFI (JPL's Implementation of a Fault Injector), is being used in conjunction with an appropriate system fault model to evaluate candidate hardware and software fault tolerance architectures, to determine the sensitivity of applications to faults, and to measure the effectiveness of fault detection, isolation and recovery strategies. JIFI has been validated to inject faults into user-specified CPU registers and memory regions with a uniform random distribution in location and time. Together with verifiers, classifiers and run scripts, JIFI enables massive fault injection campaigns and statistical data analysis.

[1]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[2]  R. R. Some,et al.  REE: a COTS-based fault tolerant parallel processing supercomputer for spacecraft onboard scientific data analysis , 1999, Gateway to the New Millennium. 18th Digital Avionics Systems Conference. Proceedings (Cat. No.99CH37033).

[3]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[4]  Daniel S. Katz,et al.  Detailed radiation fault modeling of the Remote Exploration and Experimentation (REE) first generation testbed architecture , 2000, 2000 IEEE Aerospace Conference. Proceedings (Cat. No.00TH8484).