论文信息 - A scalable runtime fault detection mechanism for high performance computing

A scalable runtime fault detection mechanism for high performance computing

Fault detection is a process of deducing the exact source of an application failure using a set of observed symptoms. However, it has become an increasingly challenging issue in high performance computing (HPC) applications using message-passing interface (MPI). Various runtime fault detection techniques such as Marmot, Umpire and ISP have been used for MPI applications. However, as the MPI applications scale out, their complexity increases proportionally, resulting in the rapid deterioration of the existing runtime fault detection techniques. In this context, we propose a scalable runtime fault detection mechanism, namely SRFD, which provides a distributed lightweight service using tree-based fault detection algorithms at runtime. In essence, SRFD serves as a fault detection engine within message-passing libraries by logically building all application processes into a tree topology, and designing the fault report and analysis algorithms with pertinence. We present details of the SRFD mechanism, including the implementation of the fault report and analysis algorithms. Further, we develop the fault detection engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 24 computing nodes, which demonstrate the capability of SRFD by detecting common faults such as deadlock, invalid argument and type matching.

Jian Gao | Peng Qing | Kang Yu

[1] Michael M. Resch,et al. MARMOT: An MPI Analysis and Checking Tool , 2003, PARCO.

[2] Kincho H. Law,et al. ParCYCLIC: finite element modelling of earthquake liquefaction response on parallel computers , 2004 .

[3] Malgorzata Steinder,et al. A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[4] Brian Randell,et al. Fundamental Concepts of Dependability , 2000 .

[5] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[6] Mark Weissman,et al. Real-time telecommunication network management: extending event correlation with temporal constraints , 1995, Integrated Network Management.

[7] Bronis R. de Supinski,et al. Dynamic Software Testing of MPI Applications with Umpire , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[8] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[9] Ganesh Gopalakrishnan,et al. ISP: a tool for model checking MPI programs , 2008, PPOPP.