Saving Time in a Program Robustness Evaluation

The risk of having a program execution corrupted by transient faults is growing as computer processors are using more transistors, are becoming denser and are operating at lower voltages. This risk is multiplied when we take into account High Performance Computing with its hundreds or thousands of processors working together to solve a single problem. To evaluate how program executions behave in presence of transient faults we have proposed the concept of robustness against transient faults. This concept can be used to determine the more significant parts of a program with respect to the risk of misbehavior by transient faults for further study of improvement. The robustness concept can also be used as a metric to compare different approaches applied to a program to make it less likely of producing corrupted results. In this work we present why and how is possible to simplify a fraction of a program's robustness by taking into account the repetition of sequences of instructions. The simplified analysis obtains the exact same result as a full program robustness evaluation (exhaustively and without estimations). By simplifying the analysis we were able to reduce in up to 192 times our previously published robustness analysis time and also were able to evaluate larger programs in feasible time (unimaginable by using executions in a fault injection capable environment).

[1]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[2]  David I. August,et al.  Configurable Transient Fault Detection via Dynamic Binary Translation , 2006 .

[3]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[4]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[5]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[6]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[7]  Emilio Luque,et al.  A Methodology to Calculate a Program's Robustness against Transient Faults , 2007 .

[8]  Jing Yu,et al.  ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.

[9]  Michael Engel,et al.  Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment , 2012 .

[10]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[11]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[12]  David R. Kaeli,et al.  Quantifying software vulnerability , 2008, WREFT '08.