FIMD-MPI: a tool for injecting faults into MPI application

Parallel computing is seeing increasing use in critical applications. The need therefore arises to test the robustness of parallel applications in the presence of exceptional conditions, or faults. Communication-software-based fault injection is an extremely flexible approach to robustness testing in message-passing parallel computers. A fault injection methodology and tool that use this approach are presented. The tool, known as FIMD-MPI, allows injection of faults into MPI-based applications. The structure and operation of FIMD-MPI are described and the use of the tool is illustrated on an example fault-tolerant MPI application.

[1]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[2]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[3]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[4]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[5]  Flaviu Cristian,et al.  Clock Synchronization in the Presence of Omission and Performance Faults, and Processor Joins , 1986 .

[6]  Douglas M. Blough,et al.  The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems , 1999, IEEE Trans. Computers.

[7]  Douglas M. Blough,et al.  Fault-injection-based testing of fault-tolerant algorithms in message-passing parallel computers , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[8]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[9]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[10]  Douglas M. Blough,et al.  Multistep Interactive Convergence: An Efficient Approach to the Fault-Tolerant Clock Synchronization of Large Multicomputers , 1998, IEEE Trans. Parallel Distributed Syst..

[11]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[12]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[13]  Blair F. Lewis,et al.  MAX - An advanced parallel computer for space applications , 1991 .

[14]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[15]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[16]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.