A Service for Reliable Execution of Grid Applications

In grid environments, with the large number of components (both hardware and software) that are involved in application execution, the overall probability that at least one of these components is (temporarily) non-functional is increasing rapidly. In traditional operating systems, such failures are flagged as fatal and the application will be stopped, relying on a re-start after the problem will have been fixed. In a large grid system, this is not a feasible approach as failures happen too frequently while error diagnostics might not be possible at all.

[1]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[2]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[3]  D Xavier,et al.  On the Design of a Failure Detection Service for Large-Scale Distributed Systems , 2003 .

[4]  Satoshi Matsuoka,et al.  Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing , 2003, Journal of Grid Computing.

[5]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[6]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[7]  Miron Livny,et al.  Phoenix: making data-intensive grid applications fault-tolerant , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[9]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[10]  Jem Treadwell,et al.  Open Grid Services Architecture , 2006, Grid-Based Problem Solving Environments.

[11]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[12]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[13]  Douglas Thain,et al.  Error scope on a computational grid: theory and practice , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[14]  Amit Jain,et al.  Failure detection and membership management in grid environments , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[15]  Jason Maassen,et al.  Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[16]  Francisco Vilar Brasileiro,et al.  Faults in grids: why are they so bad and what can be done about it? , 2003, Proceedings. First Latin American Web Congress.

[17]  I. Sommerville,et al.  A Container-Based Approach to Fault Tolerance in Service-Oriented Architectures , 2022 .

[18]  Warren Smith,et al.  An Execution Service for Grid Computing , 2004 .

[19]  Rob van Nieuwpoort,et al.  The Grid Application Toolkit: Toward Generic and Easy Application Programming Interfaces for the Grid , 2005, Proceedings of the IEEE.

[20]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[21]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[22]  Natalia Currle-Linde,et al.  Redesigning the Segl Problem Solving Environment: A Case Study of Using Mediator Components , 2007 .

[23]  Jesús Labarta,et al.  Programming Grid Applications with GRID Superscalar , 2003, Journal of Grid Computing.

[24]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.