A framework for evaluating comprehensive fault resilience mechanisms in numerical programs

As HPC systems approach Exascale, their circuit features will shrink while their overall size will grow, both at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for programs that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on program resilience to such faults, ranging from generic mechanisms such as replication or checkpoint/restart to algorithm-specific error detection and resilience mechanisms. Effective use of such mechanisms requires a detailed understanding of (1) which vulnerable parts of the program are most worth protecting and (2) the performance and resilience impact of fault resilience mechanisms on the program. This paper presents FaultTelescope, a tool that combines these two and generates actionable insights by presenting program vulnerabilities and impact of fault resilience mechanisms in an intuitive way.

[1]  Nathan DeBardeleben,et al.  Extra Bits on SRAM and DRAM Errors - More Data from the Field. , 2014 .

[2]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[3]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[4]  Martin Schulz,et al.  Fault resilience of the algebraic multi-grid solver , 2012, ICS '12.

[5]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[6]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Soft Error Resilience , 2011, 2011 IEEE International Conference on Cluster Computing.

[9]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[10]  Hua Li,et al.  Thermally-induced soft errors in nanoscale CMOS circuits , 2007, 2007 IEEE International Symposium on Nanoscale Architectures.

[11]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors , 2012, ICCS.

[12]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[13]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[14]  Pedro J. Gil,et al.  Fault Injection into VHDL Models: Experimental Validation of a Fault Tolerant Microcomputer System , 1999, EDCC.

[15]  Rakesh Kumar,et al.  A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[16]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[17]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[18]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[19]  M. L. Alles,et al.  Technology scaling and soft error reliability , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[20]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[21]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[22]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[24]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Matthew Wrobel DRC ( Digital Room Correction ) , 2011 .

[26]  Ziming Zhang,et al.  Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience , 2011, Euro-Par Workshops.