MPI-Mitten: Enabling Migration Technology in MPI

Group communications are commonly used in parallel and distributed environment. However, existing migration mechanisms do not support group communications. This weakness prevents migrationbased proactive fault tolerance, among others, to be applied to MPI applications. In this study, we propose distributed migration protocols with group membership management to support process migration with group changing. We design and implement a process migration enabling MPI library, named MPIMitten, to verify the protocols and enhance current MPI platforms for reliability and usability. MPI-Mitten is based on MPI standard and can be applied to any MPI-2 implementations. Experimental results show the proposed distributed process migration protocols are solid and the MPI-Mitten system is effective and is uniquely supporting migration-based fault tolerance.

[1]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[2]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[3]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[4]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[5]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[6]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[7]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[8]  Peter Smith,et al.  Heterogeneous process migration: the Tui system , 1998, Softw. Pract. Exp..

[9]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[10]  Jack J. Dongarra,et al.  Scalable Fault Tolerant MPI: Extending the Recovery Algorithm , 2005, PVM/MPI.

[11]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[12]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[14]  Cong Du,et al.  Runtime system for autonomic rescheduling of MPI programs , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[15]  Cong Du,et al.  HPCM: a pre-compiler aided middleware for the mobility of legacy code , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[16]  Mike Hibler,et al.  User-level checkpointing through exportable kernel state , 1996, Proceedings of the Fifth International Workshop on Object-Orientation in Operation Systems.

[17]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[18]  Xiaohu Sun,et al.  Runtime system for autonomic rescheduling of MPI programs , 2004 .

[19]  Marvin Theimer,et al.  Heterogeneous process migration by recompilation , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[20]  Xian-He Sun,et al.  Communication state transfer for the mobility of concurrent heterogeneous computing , 2001, IEEE Transactions on Computers.

[21]  Thomas Hérault,et al.  Hybrid preemptive scheduling of MPI applications on the grids , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[22]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).