Terrestrial-based radiation upsets: a cautionary tale

Problems with terrestrial-based neutron radiation from cosmic rays have become more commonplace. While the incident rate from neutron radiation is lower than space-based radiation, physics, system design and system locations have combined to make systems increasingly vulnerable to terrestrial radiation. FPGA systems are particularly sensitive to neutron radiation, as the FPGAs, microprocessors and memory are all sensitive to upsets. We are interested in reconfigurable supercomputers, which need to be highly reliable and highly available despite being very sensitive to radiation. In this paper, we estimate the error rate for FPGAs, memory, and microprocessors so that predictions for the sensitivity of the Cray XD1 reconfigurable supercomputer can be made. We also present possible mitigation methods that are appropriate for neutron radiation upset rates.

[1]  M. Caffrey,et al.  Detection of Configuration Memory Upsets Causing Persistent Errors in SRAM-based FPGAs , 2004 .

[2]  Paul Graham,et al.  Radiation effects and mitigation strategies for modern FPGAs , 2004 .

[3]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[4]  T. Tooman,et al.  High altitude compact solid state 95 GHz cloud radar , 2002, IEEE International Geoscience and Remote Sensing Symposium.

[5]  E. Normand Single-event effects in avionics , 1996 .

[6]  C. Carmichael,et al.  SEU mitigation testing of Xilinx Virtex II FPGAs , 2003, 2003 IEEE Radiation Effects Data Workshop.

[7]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[8]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[9]  I. Xilinx,et al.  Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete data sheet , 2004 .

[10]  Massimo Violante,et al.  An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[11]  M. Caffrey,et al.  Correcting single-event upsets through virtex partial configuration , 2000 .

[12]  J. F. Ziegler,et al.  Terrestrial cosmic ray intensities , 1998, IBM J. Res. Dev..

[13]  K. Kimura,et al.  Impact of neutron flux on soft errors in MOS memories , 1998, International Electron Devices Meeting 1998. Technical Digest (Cat. No.98CH36217).

[14]  Jeremy Kepner,et al.  Deployment of SAR and GMTI Signal Processing on a Boeing 707 Aircraft Using pMatlab and a Bladed Linux Cluster , 2004 .