Achieving Fault Tolerance on Grids with the CPPC Framework and the GridWay Metascheduler

Grids have brought a significant increase in the number of available resources that can be provided to applications. In the last decade, an important effort has been made to develop middleware that provides grids with functionalities related to application execution. However, support for fault-tolerant executions is either lacking or limited. This paper presents an experience to endow with fault tolerance support parallel executions on grids through the integration of CPPC, a check pointing tool for parallel applications, and Grid Way, a well-known met scheduler provided with the Globus Toolkit. Since both tools are not immediately compatible, a new architecture, called CPPC-GW, has been designed and implemented to allow for the transparent execution of CPPC applications through Grid Way. The performance of the solution has been evaluated using the NAS Parallel Benchmarks. Detailed experimental results show the low overhead of the approach.

[1]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[2]  Gabriel Rodríguez,et al.  A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes , 2009, J. Univers. Comput. Sci..

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[5]  Dennis Gannon,et al.  Checkpoint and restart for distributed components in XCAT3 , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[6]  Gabriel Rodríguez,et al.  Performance evaluation of an application-level checkpointing solution on grids , 2010, Future Gener. Comput. Syst..

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Ian T. Foster,et al.  MPICH-G2: A Grid-enabled implementation of the Message Passing Interface , 2002, J. Parallel Distributed Comput..

[9]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[10]  Christopher E. Dabrowski,et al.  Reliability in grid computing systems , 2009, Concurr. Comput. Pract. Exp..

[11]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[12]  Bettina Schnor,et al.  Migol: A fault-tolerant service framework for MPI applications in the grid , 2008, Future Gener. Comput. Syst..

[13]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[14]  Dennis Gannon,et al.  XCAT3: a framework for CCA components as OGSA services , 2004, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings..