Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer

Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.

[1]  Mitra Subhasish,et al.  Neutron beam irradiation study of workload dependence of SER in a microprocessor , 2009 .

[2]  Heather M. Quinn,et al.  Neutron Beam Testing of High Performance Computing Hardware , 2010, 2011 IEEE Radiation Effects Data Workshop.

[3]  D. S. Katz Application-based fault tolerance for spaceborne applications , 2002 .

[4]  D. M. Hiemstra,et al.  Single event upset characterization of the Pentium/sup /spl reg// 4, Pentium/sup /spl reg// III and low power Pentium/sup /spl reg// MMX microprocessors using proton irradiation , 2002, IEEE Radiation Effects Data Workshop.

[5]  Ryuji Kan,et al.  Validation of hardware error recovery mechanisms for the SPARC64 V microprocessor , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[6]  F. Irom Guideline for ground radiation testing of microprocessors in the space radiation environment , 2008 .

[7]  F. Wrobel,et al.  Hafnium and Uranium Contributions to Soft Error Rate at Ground Level , 2008, IEEE Transactions on Nuclear Science.

[8]  William Daughton,et al.  Advances in petascale kinetic plasma simulation with VPIC and Roadrunner , 2009 .

[9]  Ravishankar K. Iyer,et al.  Application fault tolerance with Armor middleware , 2005, IEEE Internet Computing.

[10]  D. M. Hiemstra,et al.  Single event upset characterization of the Pentium(R) MMX and Celeron(R) microprocessors using proton irradiation , 2000, 2000 IEEE Radiation Effects Data Workshop. Workshop Record. Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference (Cat. No.00TH8527).

[11]  Insoo Jun,et al.  Results of Recent 14 MeV Neutron Single Event Effects Measurements Conducted by the Jet Propulsion Laboratory , 2007, 2007 IEEE Radiation Effects Data Workshop.

[12]  Tapabrata Maiti,et al.  Bayesian Data Analysis (2nd ed.) (Book) , 2004 .

[13]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[14]  John A. Gunnels,et al.  Programming the Linpack benchmark for Roadrunner , 2009, IBM J. Res. Dev..

[15]  Philippe Roche,et al.  Soft-errors induced by terrestrial neutrons and natural alpha-particle emitters in advanced memory circuits at ground level , 2010, Microelectron. Reliab..

[16]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[17]  Martin Hopkins,et al.  A novel SIMD architecture for the cell heterogeneous chip-multiprocessor , 2005, 2005 IEEE Hot Chips XVII Symposium (HCS).

[18]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[19]  Heather M. Quinn,et al.  A Test Methodology for Determining Space Readiness of Xilinx SRAM-Based FPGA Devices and Designs , 2009, IEEE Transactions on Instrumentation and Measurement.

[20]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[21]  P. Morris,et al.  Single Event Effects in Power MOSFETs Due to Atmospheric and Thermal Neutrons , 2011, IEEE Transactions on Nuclear Science.

[22]  D. M. Hiemstra,et al.  Single event upset characterization of the Pentium(R) MMX and Pentium(R) II microprocessors using proton irradiation , 1999 .

[23]  D.M. Hienistra,et al.  Single event upset characterization of the Pentium(R) MMX and low power Pentium(R) MMX microprocessors using proton irradiation , 2001, 2001 IEEE Radiation Effects Data Workshop. NSREC 2001. Workshop Record. Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference (Cat. No.01TH8588).

[24]  Joel Emer,et al.  Computing Architectural Vulnerability Factors for Address-Based Structures , 2005, ISCA 2005.

[25]  Cristian Constantinescu Neutron SER characterization of microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[26]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[27]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[28]  Farokh Irom,et al.  Single-event upset in the PowerPC750 microprocessor , 2001 .

[29]  C.K. Kouba,et al.  Single-Event Upset and Scaling Trends in New Generation of the Commercial SOI PowerPC Microprocessors , 2006, IEEE Transactions on Nuclear Science.

[30]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..