Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for both reactive and proactive mechanisms to tolerate faults. These techniques rely on external components such as system logs and infrastructure monitors to provide information about hardware/software failure either through detection, or as a prediction. However, these middleware work in isolation, without disseminating the knowledge of faults encountered. In this context, we propose a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring using the Intelligent Platform Management Interface (IPMI) and coordinated propagation of fault information using the Fault-Tolerance Backplane (FTB). In essence, it serves as a middleman between system hardware and the software stack by translating raw hardware events to structured software events and delivering it to any interested component using a publish-subscribe framework. Fault-predictors and other decision-making engines that rely on distributed failure information can benefit from FTB-IPMI to facilitate proactive fault-tolerance mechanisms such as preemptive job migration. We have developed a fault-prediction engine within MVAPICH2, an RDMA-based MPI implementation, to demonstrate this capability. Failure predictions made by this engine are used to trigger migration of processes from failing nodes to healthy spare nodes, thereby providing resilience to the MPI application. Experimental evaluation clearly indicates that a single instance of FTB-IPMI can scale to several hundreds of nodes with a remarkably low resource-utilization footprint. A deployment of FTB-IPMI that services a cluster with 128 compute-nodes, sweeps the entire cluster and collects IPMI sensor information on CPU temperature, system voltages and fan speeds in about 0.75 seconds. The average CPU utilization of this service running on a single node is 0.35%.

[1]  Ronald Minnich Supermon: High-Performance Monitoring for Linux Clusters , 2001, Annual Linux Showcase & Conference.

[2]  Sean Dague OpenHPI: An Open Source Reference Implementation of the SA Forum Hardware Platform Interface , 2004, ISAS.

[3]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[4]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[5]  S. Scott,et al.  A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster , 2004 .

[6]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[7]  Matthew Wilcox,et al.  Effective HPC hardware management and Failure prediction strategy using IPMI , 2003 .

[8]  A. Arredondo,et al.  Implementing PWM fan speed control within a computer chassis power supply , 2005, Twentieth Annual IEEE Applied Power Electronics Conference and Exposition, 2005. APEC 2005..

[9]  Dhabaleswar K. Panda,et al.  Nomad: migrating OS-bypass networks in virtual machines , 2007, VEE '07.

[10]  Dhabaleswar K. Panda,et al.  CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems , 2009, 2009 International Conference on Parallel Processing.

[11]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[12]  Bert J. Debusschere,et al.  Ovis-2: A robust distributed architecture for scalable RAS , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Kincho H. Law,et al.  ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers , 2004 .

[14]  Matei Ripeanu,et al.  Failure Avoidance through Fault Prediction Based on Synthetic Transactions , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[16]  D. K. Panda InfiniBand Architecture , 2001 .

[17]  Terry Jones,et al.  Accurate fault prediction of BlueGene/P RAS logs via geometric reduction , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[18]  John E. Stone,et al.  Rendering of numerical flow simulations using MPI , 1996, Proceedings. Second MPI Developer's Conference.

[19]  J. Steele ACPI thermal sensing and control in the PC , 1998, Wescon/98. Conference Proceedings (Cat. No.98CH36265).