Fault-tolerant computing for radiation environments

Radiation, such as alpha particles and cosmic rays, can cause transient faults in electronic systems. Such faults cause errors called Single-Event Upsets (SEUs). SEUs are a major source of errors in electronics used in space applications. There is also a growing concern about SEUs at ground level for deep submicron technologies. In this dissertation, we compared different approaches to providing fault tolerance against radiation effects and developed new techniques for fault tolerance and radiation characterization of systems. Estimating the SEU error rate of individual units of a digital circuit is very important in designing a fault-tolerant system. We developed a new software method that uses weighted test programs and multiple linear regression for SEU characterization of digital circuits. We also show how errors in bistables can be distinguished from errors in combinational logic by operating a sequential circuit at different clock frequencies. Radiation hardening is a fault avoidance technique used for electronic components used in space. However, these components are expensive and lag behind today's commercial components in terms of performance. Using Commercial Off-The-Shelf (COTS) components, as opposed to radiation-hardened components, has been suggested for providing the higher computing power that is required for autonomous navigation and on-board data processing in space. We compared these two approaches in an actual space experiment. We collected errors from two processor boards, one radiation-hardened and one COTS, on board the ARGOS satellite. We designed and implemented software techniques for detecting, correcting and recovering from errors. We demonstrated that the reliability of COTS components can be enhanced by using software techniques without changing the hardware. Despite the 170% time overhead of the software techniques used on the COTS board, the throughput of the COTS board was an order of magnitude higher than that of the radiation-hardened board. The throughput of the radiation-hardened board would be the same as that of the COTS board if the radiation-hardened board had cache memory. We also developed a new technique for tolerating permanent faults in cache memories. The main advantage of this technique is its low performance degradation even in the presence of a large number of faults.

[1]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[2]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[3]  R. Baumann,et al.  Neutron-induced boron fission as a major source of soft errors in deep submicron SRAM devices , 2000, 2000 IEEE International Reliability Physics Symposium Proceedings. 38th Annual (Cat. No.00CH37059).

[4]  M. Baze,et al.  Attenuation of single event induced pulses in CMOS combinational logic , 1997 .

[5]  Eiji Fujiwara,et al.  Error-control coding for computer systems , 1989 .

[6]  P. Garnier,et al.  Total dose failures in advanced electronics from single ions , 1993 .

[7]  R. Koga,et al.  Single-word multiple-bit upsets in static random access devices , 1993 .

[8]  Michael Paul Kowalski,et al.  USA experiment on the ARGOS satellite: a low-cost instrument for timing x-ray binaries , 1994, Optics & Photonics.

[9]  D. A. Clark,et al.  Single event effects and performance predictions for space applications of RISC processors , 1994 .

[10]  T. Sugii,et al.  Impact of cosmic ray neutron induced soft errors on advanced submicron CMOS circuits , 1996, 1996 Symposium on VLSI Technology. Digest of Technical Papers.

[11]  T. Calin,et al.  Upset hardened memory design for submicron CMOS technology , 1996 .

[12]  A. B. Campbell,et al.  Alpha-, boron-, silicon- and iron-ion-induced current transients in low-capacitance silicon and GaAs diodes , 1988 .

[13]  K.A. LaBel,et al.  Commercial microelectronics technologies for applications in the satellite radiation environment , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[14]  R. Hokinson,et al.  Historical trend in alpha-particle induced soft error rates of the Alpha/sup TM/ microprocessor , 2001, 2001 IEEE International Reliability Physics Symposium Proceedings. 39th Annual (Cat. No.00CH37167).

[15]  Mark D. Hill,et al.  Performance Implications of Tolerating Cache Faults , 1993, IEEE Trans. Computers.

[16]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[17]  T. Yamada,et al.  In-orbit experiment on the fault-tolerant space computer aboard the satellite Hiten , 1996, IEEE Trans. Reliab..

[18]  Allan H. Johnston,et al.  A new class of single event hard errors [DRAM cells] , 1994 .

[19]  Charles E. Barnes,et al.  Accounting for time-dependent effects on CMOS total-dose response in space environments , 1994 .

[20]  Michael Mueller,et al.  RAS strategy for IBM S/390 G5 and G6 , 1999, IBM J. Res. Dev..

[21]  S. Wicker Error Control Systems for Digital Communication and Storage , 1994 .

[22]  P. W. Marshall,et al.  Single Event Upset cross sections at various data rates , 1996 .

[23]  G. R. Brown,et al.  Honeywell radiation hardened 32-bit processor central processing unit, floating point processor, and cache memory dose rate and single event effects test results , 1997, 1997 IEEE Radiation Effects Data Workshop NSREC Snowmass 1997. Workshop Record Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference.

[24]  Algirdas Avizienis,et al.  Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.

[25]  Marty R. Shaneyfelt,et al.  Radiation hardness assurance categories for COTS technologies , 1997, 1997 IEEE Radiation Effects Data Workshop NSREC Snowmass 1997. Workshop Record Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference.

[26]  R. Koga,et al.  Investigation of non-independent single event upsets in the TAOS GVSC static RAM , 1997, 1997 IEEE Radiation Effects Data Workshop NSREC Snowmass 1997. Workshop Record Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference.

[27]  G. K. Lum,et al.  System hardening approaches for a LEO satellite with radiation tolerant parts , 1997 .

[28]  Janak H. Patel,et al.  Memory System Design for Tolerating Single Event Upsets , 1983, IEEE Transactions on Nuclear Science.

[29]  T. R. Weatherford,et al.  Laser confirmation of SEU experiments in GaAs MESFET combinational logic (for space application) , 1992 .

[30]  Richard Howard Paschburg Software Implementation of Error-Correcting Codes, , 1974 .

[31]  Daniel S. Katz,et al.  Detailed radiation fault modeling of the Remote Exploration and Experimentation (REE) first generation testbed architecture , 2000, 2000 IEEE Aerospace Conference. Proceedings (Cat. No.00TH8484).

[32]  D. M. Hiemstra,et al.  Single event upset characterization of the Pentium(R) MMX and Pentium(R) II microprocessors using proton irradiation , 1999 .

[33]  E. Normand Single-event effects in avionics , 1996 .

[34]  Marty R. Shaneyfelt,et al.  Use of COTS microelectronics in radiation environments , 1999 .

[35]  Edward J. McCluskey,et al.  Software-implemented EDAC protection against SEUs , 2000, IEEE Trans. Reliab..

[36]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[37]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[38]  Melvin A. Breuer,et al.  Digital systems testing and testable design , 1990 .

[39]  David A. Patterson,et al.  Architecture of a VLSI instruction cache for a RISC , 1983, ISCA '83.

[40]  J. F. Ziegler,et al.  Terrestrial cosmic ray intensities , 1998, IBM J. Res. Dev..

[41]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[42]  Craig Underwood,et al.  SEU induced errors observed in microprocessor systems , 1998 .

[43]  R. Koga,et al.  SEU Vulnerability of the Zilog Z-80 and NSC-800 Microprocessors , 1985, IEEE Transactions on Nuclear Science.

[44]  Kenneth A. LaBel,et al.  Fiber optic data bus space experiment on board the microlectronics and photonics test bed (MPTB) , 1995, Defense, Security, and Sensing.

[45]  C. Underwood The single-event-effect behaviour of commercial-off-the-shelf memory devices-A decade in low-Earth orbit , 1997 .

[46]  R. Reed,et al.  Heavy ion and proton-induced single event multiple upset , 1997 .

[47]  R. Koga,et al.  A method for characterizing a microprocessor's vulnerability to SEU , 1988 .

[48]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[49]  R. Koga,et al.  Techniques of Microprocessor Testing and SEU-Rate Prediction , 1985, IEEE Transactions on Nuclear Science.

[50]  Trevor N. Mudge,et al.  Instruction fetching: Coping with code bloat , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[51]  David Chih-Wei Chang,et al.  Fault-Tolerant Features in the HaL Memory Management Unit , 1995, IEEE Trans. Computers.

[52]  R. Baumann,et al.  Boron compounds as a dominant source of alpha particles in semiconductor devices , 1995, Proceedings of 1995 IEEE International Reliability Physics Symposium.

[53]  J. R. Kimbrough,et al.  Proton-induced SEU, dose effects, and LEO performance predictions for R3000 microprocessors , 1992 .

[54]  Daniel S. Katz,et al.  Demonstration of the remote exploration and experimentation (REE) fault-tolerant parallel-processing supercomputer for spacecraft onboard scientific data processing , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[55]  Dilip V. Sarwate Computation of cyclic redundancy checks via table look-up , 1988, CACM.

[56]  Jie Liu,et al.  Heavy ion induced single event effects in semiconductor device , 1998 .

[57]  Andrew Holmes-Siedle,et al.  Handbook of Radiation Effects , 1993 .

[58]  Se June Hong,et al.  Optimal Rectangular Code for High Density Magnetic Tapes , 1974, IBM J. Res. Dev..

[59]  Gwan Choi,et al.  The single event upset characteristics of the 486-DX4 microprocessor , 1997, 1997 IEEE Radiation Effects Data Workshop NSREC Snowmass 1997. Workshop Record Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference.

[60]  Lynn Youngs,et al.  Design of the UltraSPARC-I microprocessor for manufacturing performance , 1996, Advanced Lithography.

[61]  Jon C. Muzio,et al.  A fault-tolerant multiprocessor cache memory , 1994, Proceedings of IEEE International Workshop on Memory Technology, Design, and Test.

[62]  R. Koga,et al.  Single event upset at gigahertz frequencies , 1994 .

[63]  Guu-Chang Yang Reliability of semiconductor RAMs with soft-error scrubbing techniques , 1995 .

[64]  M. Shea,et al.  CREME96: A Revision of the Cosmic Ray Effects on Micro-Electronics Code , 1997 .

[65]  M. A. Lucente,et al.  Memory system reliability improvement through associative cache redundancy , 1990, IEEE Proceedings of the Custom Integrated Circuits Conference.

[66]  R. Koga,et al.  Heavy Ion-Induced Single Event Upsets of Microcircuits; A Summary of the Aerospace Corporation Test Data , 1984, IEEE Transactions on Nuclear Science.

[67]  Clive Dyer,et al.  Observations of single-event upsets in non-hardened high-density SRAMs in Sun-synchronous orbit , 1992 .

[68]  A. E. Waskiewicz,et al.  Experimental and simulation study of the effects of cosmic particles on CMOS/SOS RAMs , 1990 .

[69]  R. Harboe-Sorensen,et al.  The SEU and Total Dose Response of the INMOS Transputer , 1987, IEEE Transactions on Nuclear Science.

[70]  S. Buchner,et al.  Critical evaluation of the pulsed laser method for single event effects testing and fundamental studies , 1994 .

[71]  Mark Horowitz,et al.  ATUM: a new technique for capturing address traces using microcode , 1986, ISCA '86.

[72]  D.C. Feldmeier,et al.  Fast software implementation of error detection codes , 1995, TNET.

[73]  Haridimos T. Vergos,et al.  On the Yield of VLSI Processors with On-Chip CPU Cache , 1999, IEEE Trans. Computers.

[74]  M. Gussenhoven,et al.  APEXRAD: low altitude orbit dose as a function of inclination, magnetic activity and solar cycle , 1997 .

[75]  C. L. Axness,et al.  SEU characterization and design dependence of the SA3300 microprocessor , 1990 .

[76]  Z. Hasnain,et al.  Building-in reliability: soft errors-a case study , 1992, 30th Annual Proceedings Reliability Physics 1992.

[77]  M. S. Hodgart Efficient coding and error monitoring for spacecraft digital memory , 1992 .

[78]  A. Narayanan Probability and statistics in engineering and management science , 1972 .

[79]  K. Johansson,et al.  Neutron induced single-word multiple-bit upset in SRAM , 1999 .

[80]  Craig Underwood,et al.  Comparison between observed and theoretically determined SEU rates in the TEXAS TMS4416 DRAMs on-board the UoSAT-2 micro-satellite , 1997 .

[81]  James C. Pickel,et al.  Single Event Upset in Combinatorial and Sequential Current Mode Logic , 1985, IEEE Transactions on Nuclear Science.

[82]  Arvind Motibhai Patel Adaptive cross-parity (AXP) code for a high-density magnetic tape subsystem , 1985 .

[83]  Johan Karlsson,et al.  On latching probability of particle induced transients in combinational networks , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[84]  Y. Ooi,et al.  Fault-tolerant architecture in a cache memory control LSI , 1992 .

[85]  Milton Ohring,et al.  Reliability and Failure of Electronic Materials and Devices, Second Edition , 1998 .

[86]  O. Flament,et al.  Dynamic single event effects in a CMOS/thick SOI shift register , 1995 .

[87]  David Burnett,et al.  Soft-error-rate improvement in advanced BiCMOS SRAMs , 1993, 31st Annual Proceedings Reliability Physics 1993.

[88]  M. Baze,et al.  Comparison of error rates in combinational and sequential logic , 1997 .

[89]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[90]  S. Hareland,et al.  Methods for reducing soft errors in deep submicron integrated circuits , 1998, 1998 5th International Conference on Solid-State and Integrated Circuit Technology. Proceedings (Cat. No.98EX105).

[91]  Jih-Jong Wang,et al.  SRAM based re-programmable FPGA for space applications , 1999 .

[92]  Paul R. Turgeon,et al.  Two approaches to array fault tolerance in the IBM Enterprise System/9000 Type 9121 processor , 1991, IBM J. Res. Dev..

[93]  A. B. Campbell,et al.  Modification of single event upset cross section of an SRAM at high frequencies , 1995 .

[94]  Haridimos T. Vergos,et al.  Performance recovery in direct-mapped faulty caches via the use of a very small fully associative spare cache , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[95]  T. J. O'Gorman The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .

[96]  R. Velazco,et al.  SEU testing of 32-bit microprocessors (for space application) , 1992, Workshop Record 1992 IEEE Radiation Effects Data Workshop.

[97]  Marty R. Shaneyfelt,et al.  Hardness variability in commercial technologies , 1994 .

[98]  Kenneth A. LaBel,et al.  Single event effect testing of the Intel 80386 family and the 80486 microprocessor , 1995 .

[99]  Rodney M. Goodman,et al.  The reliability of semiconductor RAM memories with on-chip error-correction coding , 1991, IEEE Trans. Inf. Theory.

[100]  Gurindar S. Sohi Cache Memory Organization to Enhance the Yield of High-Performance VLSI Processors , 1989, IEEE Trans. Computers.