One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach

This paper proposes a heuristic to improve the analysis of supercomputers error logs. The heuristic is able to estimate the error on the measurement induced by the clustering process of error events and consequently drive the analysis. The goal is to reduce errors induced by the clustering and be able to estimate how much they affect the measurements. The heuristic is validated against 40 synthetic datasets, for different systems ranging from 16k to 256k nodes under different failure assumptions. We show that i) to accurately analyze the complex failure behavior of large computing systems, multiple time windows need to be adopted at the granularity of node subsystems, e.g. memory and I/O, and ii) for large systems, the classical single time window analysis can overestimate the MTBF by more than 150%, while the proposed heuristic can decrease the measurement error of one order of magnitude.

[1]  Jesper Larsson Träff,et al.  Euro-Par 2010 Parallel Processing Workshops - HeteroPar, HPCC, HiBB, CoreGrid, UCHPC, HPCF, PROPER, CCPI, VHPC, Ischia, Italy, August 31-September 3, 2010, Revised Selected Papers , 2011, Euro-Par Workshops.

[2]  Luisa Carracciuolo,et al.  Modelling the Behaviour of an Adaptive Scheduling Controller , 2012, 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems.

[3]  Domenico Cotroneo,et al.  How Do Mobile Phones Fail? A Failure Data Analysis of Symbian OS Smart Phones , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[4]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[5]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[6]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[7]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Francesco Palmieri,et al.  A Fault Avoidance Strategy Improving the Reliability of the EGI Production Grid Infrastructure , 2010, OPODIS.

[9]  Valentina Casola,et al.  Security and Performance Trade-off in PerfCloud , 2010, Euro-Par Workshops.

[10]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[11]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[12]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[13]  Gwan S. Choi,et al.  Error and failure analysis of a UNIX server , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[14]  Franck Cappello,et al.  HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  Daniel P. Siewiorek,et al.  A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[16]  Mohamed Kaâniche,et al.  Event log based dependability analysis of Windows NT and 2K systems , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[17]  Ravishankar K. Iyer,et al.  Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[18]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[19]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[20]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[21]  Risto Vaarandi,et al.  Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[22]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[23]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, IEEE Trans. Reliab..