Application-Level Fault-Tolerance Solutions for Grid Computing

One of the key functionalities provided by Grid systems is the remote execution of applications. This paper introduces a research proposal on fault-tolerance mechanisms for the execution of sequential and message-passing parallel applications on the Grid. A service-based architecture called CPPC-G is proposed. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpointing instrumentation into the application code. CPPC-G services will be in charge of the submission and monitoring of the application execution, management of checkpoint files generated by CPPC-enabled applications, and detection and automatic restart of failed executions. The development of the CPPC-G architecture will involve research in different areas such as storage and management of data files (checkpointfiles); automatic selection of suitable computing resources; reliable detection of execution failures and robustness issues to make the architecture fault-tolerant itself.

[1]  Fabio Kon,et al.  Strategies for Checkpoint Storage on Opportunistic Grids , 2006, IEEE Distributed Systems Online.

[2]  Bettina Schnor,et al.  Migol: A fault-tolerant service framework for MPI applications in the grid , 2008, Future Gener. Comput. Syst..

[3]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[4]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.

[5]  Gabriel Rodríguez,et al.  Controller/Precompiler for Portable Checkpointing , 2006, IEICE Trans. Inf. Syst..

[6]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Nazareno Andrade,et al.  OurGrid: An Approach to Easily Assemble Grids with Equitable Resource Sharing , 2003, JSSPP.

[9]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[10]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[11]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[12]  Gabriel Rodríguez,et al.  Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications , 2007, PaCT.

[13]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[14]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[15]  Juan Touriño,et al.  CPPC-G: Fault-Tolerant Parallel Applications on the Grid , 2007 .

[16]  Ian T. Foster,et al.  MPICH-G2: A Grid-enabled implementation of the Message Passing Interface , 2002, J. Parallel Distributed Comput..

[17]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[18]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[19]  Eduardo Huedo,et al.  A modular meta-scheduling architecture for interfacing with pre-WS and WS Grid resource management services , 2007, Future Gener. Comput. Syst..