Scalable Approach to Failure Analysis of High‐Performance Computing Systems

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.

[1]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[2]  Zhongzhi Shi,et al.  APPLICATIONS OF INCLUSION DEGREE IN ROUGH SET THEORY , 2002 .

[3]  Andrzej Skowron,et al.  Rough Sets: A Tutorial , 1998 .

[4]  Guangwen Yang,et al.  Job failures in high performance computing systems: A large-scale empirical study , 2012, Comput. Math. Appl..

[5]  Carlos A. Coello Coello,et al.  A new proposal for multi-objective optimization using differential evolution and rough sets theory , 2006, GECCO '06.

[6]  Mohsen Sharifi,et al.  Failure Prediction Mechanisms in Cluster Systems , 2008, 2008 International Conference on Biocomputation, Bioinformatics, and Biomedical Technologies.

[7]  Andrzej Skowron,et al.  Rough-Fuzzy Hybridization: A New Trend in Decision Making , 1999 .

[8]  Xian-He Sun,et al.  Performance comparison under failures of MPI and MapReduce: An analytical approach , 2013, Future Gener. Comput. Syst..

[9]  Zbigniew Suraj,et al.  Rough set methods for the synthesis and analysis of concurrent processes , 2000 .

[10]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[11]  Zhiling Lan,et al.  Fault-Aware Runtime Strategies for High-Performance Computing , 2009, IEEE Transactions on Parallel and Distributed Systems.

[12]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[13]  Vincent De Sapio,et al.  Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[14]  Tsau Young Lin,et al.  A New Rough Sets Model Based on Database Systems , 2003, Fundam. Informaticae.

[15]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[16]  Chokchai Leangsuksun,et al.  Proficiency Metrics for Failure Prediction in High Performance Computing , 2010, International Symposium on Parallel and Distributed Processing with Applications.