High-level vulnerability over space and time to insidious soft errors

The integrity of computational results is being increasingly threatened by soft errors, especially for computations that are large-scale or performed under harsh conditions. Existing methods for soft error estimation do not clearly characterize the vulnerability associated with a particular result. 1) We propose a metric which captures the intrinsic vulnerability over space and time (VST) to soft errors that corrupt computational results. The method of VST estimation bridges the gap between the inherently low-level faults and the high-level computational failures that they eventually cause. 2) We define a model of an insidious soft error and try to clear up confusion around the concept of silent data corruption. 3) We present experimental results from three vulnerability studies involving floating-point addition, CORDIC, and FFT computations. The results show that traditional vulnerability metrics can be confounded by seemingly reliable but inefficient implementations which actually incur high vulnerability per computation. The VST method characterizes vulnerability accurately, provides a figure-of-merit for comparing alternative implementations of an algorithm, and in some cases uncovers pronounced and unexpected fluctuations in vulnerability.

[1]  John F. Meyer Computation-Based Reliability Analysis , 1976, IEEE Transactions on Computers.

[2]  Tao Li,et al.  Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[3]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[4]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[5]  Petru Eles,et al.  Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[6]  Xiaodong Li,et al.  Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[7]  P. L. Springer Assessing application vulnerability to radiation-induced SEUs in memory , 2001 .

[8]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  N. Seifert,et al.  Timing vulnerability factors of sequentials , 2004, IEEE Transactions on Device and Materials Reliability.

[10]  Arijit Biswas,et al.  Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal , 2008, IEEE Computer Architecture Letters.

[11]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[12]  A.D. George,et al.  Hardware/software interface for high-performance space computing with FPGA coprocessors , 2006, 2006 IEEE Aerospace Conference.

[13]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[14]  Brad Calder,et al.  Discovering and Exploiting Program Phases , 2003, IEEE Micro.

[15]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.