Transparent Message-Passing Parallel Applications Checkpointing in Kerrighed

Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution time. Since the number of nodes in clusters is growing, the probability of a node failure during the execution of an application increases and the application execution time may be greater than the cluster mean time between failures (MTBF). To avoid restarting application from the beginning, some fault tolerant mechanisms such as checkpoint/restart are needed. Currently, checkpoint/restart mechanisms are either implemented directly in the application source code by applications programmers or are integrated in communication environments such as MPI or PVM. We propose in this paper a new approach in which checkpoint/restart mechanisms for parallel applications are implemented in a cluster single system image operating system. While this kernel level approach is more complex to implement than other approaches, it is more general because it does not require any modification, compilation or relinking of the applications whatever the communication environment they rely on. Our approach has been implemented in Kerrighed single system image operating system based on Linux. Performance results are presented in this paper. // Les grappes de calculateurs sont tres utilisees pour les applications scientifiques. Ces applications sont souvent des applications paralleles de longue duree communiquant par message. Comme le nombre de noeuds dans les grappes de calculateurs augmente, la probabilite de defaillance d'un noeud augmente elle aussi et la duree d'execution d'une application peut etre superieure au temps moyen entre defaillances de la grappe. Dans ce contexte, pour eviter de relancer les applications depuis le debut en cas de defaillance, des mecanisme de tolerance aux fautes comme la sauvegarde et la restauration de point de reprises d'application sont attrayants. Generalement, les mecanismes de sauvegarde et restauration de point de reprise sont mis en oeuvre directement dans le code source de l'application ou sont integres dans les environnements de communication comme MPI ou PVM. Nous proposons dans cet article une nouvelle approche qui consiste a integrer les mecanismes de sauvegarde et restauration de point de reprise d'application parallele dans un systeme d'exploitation a image unique pour grappe de calculateurs. Cette approche systeme est plus complexe a mettre en oeuvre que les autres mais elle est plus generale car elle ne necessite pas de modification, recompilation ou reedition de liens des applications quelque soit l'environnement de communication sur lequel ces dernieres reposent. Notre approche a ete mise en oeuvre dans le systeme d'exploitation a image unique Kerrighed fonde sur Linux. Des resultats d'une evaluation de performances sont presentes dans cet article.

[1]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[2]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[3]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[4]  Christine Morin,et al.  Dynamic Streams For Efficient Communications Between Migrating Processes In A Cluster , 2003, Parallel Process. Lett..

[5]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[6]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[7]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[8]  Christine Morin,et al.  Capabilities for per Process Tuning of Distributed Operating Systems , 2004 .

[9]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[10]  Eduardo Pinheiro,et al.  Truly-Transparent Checkpointing of Parallel Applications , 1998 .

[11]  Christine Morin,et al.  Kerrighed and data parallelism: cluster computing on single system image operating systems , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[12]  Thomas Hérault,et al.  Hybrid preemptive scheduling of MPI applications on the grids , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[13]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[14]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[16]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.