Exploring the Effect of Compiler Optimizations on the Reliability of HPC Applications

The strict power efficiency constraints required to achieve exascale systems will dramatically increase the number of detected and undetected transient errors in future high performance computing (HPC) systems. Among the various factors that effect system resiliency, the impact of compiler optimizations on the vulnerability of scientific applications executed on HPC systems has not been widely explored. In this work, we analyze whether and how most common compiler optimizations impact the vulnerability of several mission-critical applications, what are the trade-offs between performance and vulnerability and the causal relations between compiler optimization and application vulnerability. We show that highly-optimized code is generally more vulnerable than unoptimized code. We also show that, while increasing optimization level can drastically improve application performance as expected. However, certain cases of optimization may provide only marginal benefits, but considerably increase application vulnerability.

[1]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[3]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[4]  Chao Chen,et al.  Fast reliability exploration for embedded processors via high-level fault injection , 2013, International Symposium on Quality Electronic Design (ISQED).

[5]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Sarita V. Adve,et al.  Relyzer: Application Resiliency Analyzer for Transient Faults , 2013, IEEE Micro.

[7]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[8]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Ronald F. DeMara,et al.  Power and quality-aware image processing soft-resilience using online multi-objective GAs , 2015, Int. J. Comput. Vis. Robotics.

[10]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[11]  Christian Engelmann,et al.  xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.

[12]  Craig B. Zilles,et al.  A characterization of instruction-level error derating and its implications for error detection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[13]  Antonio Martínez-Álvarez,et al.  Compiler-Directed Soft Error Mitigation for Embedded Systems , 2012, IEEE Transactions on Dependable and Secure Computing.

[14]  Muhammad Shafique,et al.  Exploiting program-level masking and error propagation for constrained reliability optimization , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Roger Johansson,et al.  A Study of the Impact of Bit-Flip Errors on Programs Compiled with Different Optimization Levels , 2014, 2014 Tenth European Dependable Computing Conference.

[16]  Unsal Osman,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016 .

[17]  Eric Cheng,et al.  The resilience wall: Cross-layer solution strategies , 2014, VLSI-DAT 2014.

[18]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Sriram Krishnamoorthy,et al.  Towards Resiliency Evaluation of Vector Programs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[21]  David Blaauw,et al.  Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits , 2010, Proceedings of the IEEE.

[22]  Karthikeyan Sankaralingam,et al.  Understanding the impact of gate-level physical reliability effects on whole program execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[23]  N. Seifert,et al.  Timing vulnerability factors of sequentials , 2004, IEEE Transactions on Device and Materials Reliability.

[24]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[25]  Karthik Pattabiraman,et al.  Effect of Compiler Optimizations on the Error Resilience of Soft Computing Applications , 2013 .

[26]  Kevin Skadron,et al.  Characterization of transient error tolerance for a class of mobile embedded applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[29]  S. E. Michalak,et al.  Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer , 2012, IEEE Transactions on Device and Materials Reliability.

[30]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[31]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[32]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[33]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[34]  Matei Ripeanu,et al.  Finding Resilience-Friendly Compiler Optimizations Using Meta-Heuristic Search Techniques , 2016, EDCC 2016.

[35]  Luigi Carro,et al.  Compiler Optimizations Do Impact the Reliability of Control-Flow Radiation Hardened Embedded Software , 2013, IESS.

[36]  Mary W. Hall,et al.  Analyzing the effects of compiler optimizations on application reliability , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[37]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor input/output subsystem , 2008, IBM J. Res. Dev..