Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.

[1]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  S. Scott,et al.  Reliability Analysis in HPC clusters , 2006 .

[3]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Ann C. Gentile,et al.  Meaningful Automated Statistical Analysis of Large Computational Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[5]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[7]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[8]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[9]  Bert J. Debusschere,et al.  Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Bert J. Debusschere,et al.  Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Ron A. Oldfield Lightweight storage and overlay networks for fault tolerance. , 2006 .