FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study

There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and “pluggability” to FT-MPI. The latter feature allows us, using proxies, to transparently replace one vulnerable module – its name service – with fault-tolerant replacements. We present an algorithm for improving performance of operations over the proxies. We evaluate its performance in a comparison using the original name service, OpenLDAP and current Emory research project HDNS.

[1]  Vaidy S. Sunderam,et al.  Towards Self-Organizing Distributed Computing Frameworks: The H2O Approach , 2003, Parallel Process. Lett..

[2]  Tom Dhaene,et al.  Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI , 2005, PVM/MPI.

[3]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[4]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[5]  Vaidy S. Sunderam,et al.  Combining FT-MPI with H2O: fault-tolerant MPI across administrative boundaries , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..

[7]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[8]  Dirk Gorissen,et al.  Integrating heterogeneous information services using JNDI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[9]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[10]  Jack J. Dongarra,et al.  Scalable Fault Tolerant MPI: Extending the Recovery Algorithm , 2005, PVM/MPI.

[11]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[12]  Vaidy S. Sunderam,et al.  The Harness Metacomputing Framework , 1999, PPSC.

[13]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.