论文信息 - MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing

MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing and scalable clusters. MPI/FT, the system described in the paper, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated. User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self-checking threads address four levels of MPI implementation robustness, three of which are portable to any multithreaded MPI. A taxonomy of application types provides six initial fault-relevant models; user-transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT-real-time MPI-are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future.

[1] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[2] Niraj K. Jha,et al. Algorithm-Based Fault Tolerance for FFT Networks , 1994, IEEE Trans. Computers.

[3] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[4] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[5] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[6] Mark Garland Hayden,et al. The Ensemble System , 1998 .

[7] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[8] Greg Burns,et al. LAM: An Open Cluster Environment for MPI , 2002 .