HPC failure prediction proficiency metrics

Transient failures in large-scale HPC systems are significantly increasing due to the large number of components. Fault tolerance mechanisms exist, but they cost additional overhead per invocation to application. Thus, failure prediction is needed in order to gracefully mitigate such events and to minimize the usage of mechanism. However, the proficiency metrics for HPC failure prediction are borrowed from other related fields, mainly from statistic, data mining and information theory. Some of them fit well in some perspective, but none of them consider the perspective of lost computing time due to the prediction error. Thus, we present the incompetence study in existing metrics and introduce additional metrics cope with potential lost computing time perspective to be used together with existing metrics and justifying HPC failure prediction proficiency.

[1]  Miroslaw Malek,et al.  Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[2]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[3]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .