A scalable runtime fault detection mechanism for high performance computing

Fault detection is a process of deducing the exact source of an application failure using a set of observed symptoms. However, it has become an increasingly challenging issue in high performance computing (HPC) applications using message-passing interface (MPI). Various runtime fault detection techniques such as Marmot, Umpire and ISP have been used for MPI applications. However, as the MPI applications scale out, their complexity increases proportionally, resulting in the rapid deterioration of the existing runtime fault detection techniques. In this context, we propose a scalable runtime fault detection mechanism, namely SRFD, which provides a distributed lightweight service using tree-based fault detection algorithms at runtime. In essence, SRFD serves as a fault detection engine within message-passing libraries by logically building all application processes into a tree topology, and designing the fault report and analysis algorithms with pertinence. We present details of the SRFD mechanism, including the implementation of the fault report and analysis algorithms. Further, we develop the fault detection engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 24 computing nodes, which demonstrate the capability of SRFD by detecting common faults such as deadlock, invalid argument and type matching.