Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs

This paper presents an in-depth characterization of the resiliency of more than 5 million HPC application runs completed during the first 518 production days of Blue Waters, a 13.1 petaflop Cray hybrid supercomputer. Unlike past work, we measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. The characterization is performed by means of a joint analysis of several data sources, which include workload and error/failure logs. In order to relate system errors and failures to the executed applications, we developed LogDiver, a tool to automate the data pre-processing and metric computation. Some of the lessons learned in this study include: i) while about 1.53% of applications fail due to system problems, the failed applications contribute to about 9% of the production node hours executed in the measured period, i.e., the system consumes computing resources, and system-related issues represent a potentially significant energy cost for the work lost, ii) there is a dramatic increase in the application failure probability when executing full-scale applications: 20x (from 0.008 to 0.162) when scaling XE applications from 10,000 to 22,000 nodes, and 6x (from 0.02 to 0.129) when scaling GPU/hybrid applications from 2000 to 4224 nodes, and iii) the resiliency of hybrid applications is impaired by the lack of adequate error detection capabilities in hybrid nodes.

[1]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Celso L. Mendes,et al.  Deploying a Large Petascale System: The Blue Waters Experience , 2014, ICCS.

[3]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[4]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Woo-Sun Yang,et al.  Franklin Job Completion Analysis , 2010 .

[6]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[7]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[8]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[9]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[10]  Xin Chen,et al.  Predicting job completion times using system logs in supercomputing clusters , 2013, 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W).

[11]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[12]  Andrew L. Rukhin,et al.  Analysis of Time Series Structure SSA and Related Techniques , 2002, Technometrics.

[13]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[14]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[15]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[16]  Ravishankar K. Iyer,et al.  Characterization of operational failures from a business data processing SaaS platform , 2014, ICSE Companion.

[17]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Robert G. Edwards,et al.  The Chroma Software System for Lattice QCD , 2004 .

[19]  Catello Di Martino One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach , 2013, ISC.

[20]  Alan Norton,et al.  Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).