Evaluating the reliability of computational grids from the end user's point of view

Reliability, in terms of Grid component fault tolerance and minimum quality of service, is an important aspect that has to be addressed to foster Grid technology adoption. Software reliability is critically important in today's integrated and distributed systems, as is often the weak link in system performance. In general, reliability is difficult to measure, and specially in Grid environments, where evaluation methodologies are novel and controversial matters. This paper describes a straightforward procedure to analyze the reliability of computational grids from the viewpoint of an end user. The procedure is illustrated in the evaluation of a research Grid infrastructure based on Globus basic services and the GridWay meta-scheduler. The GridWay support for fault tolerance is also demonstrated in a production-level environment. Results show that GridWay is a reliable workload management tool for dynamic and faulty Grid environments. Transparently to the end user, GridWay is able to detect and recover from any of the Grid element failure, outage and saturation conditions specified by the reliability analysis procedure.

[1]  Miron Livny,et al.  Faults in Large Distributed Systems and What We Can Do About Them , 2005, Euro-Par.

[2]  Daniel A. Reed,et al.  Performance Contracts: Predicting and Monitoring Grid Application Behavior , 2001, GRID.

[3]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[4]  Rajkumar Buyya,et al.  Grids and Grid technologies for wide‐area distributed computing , 2002, Softw. Pract. Exp..

[5]  Jennifer M. Schopf,et al.  Grids: The top ten questions , 2002, Sci. Program..

[6]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Bettina Schnor,et al.  Migol: A fault-tolerant service framework for MPI applications in the grid , 2008, Future Gener. Comput. Syst..

[9]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[10]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[11]  Eduardo Huedo,et al.  The GridWay Framework for Adaptive Scheduling and Execution on Grids , 2001, Scalable Comput. Pract. Exp..

[12]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[13]  John Shalf,et al.  The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment , 2001, Int. J. High Perform. Comput. Appl..

[14]  Stephen Gilmore,et al.  Evaluating the performance of pipeline-structured parallel programs with skeletons and process algebra , 2005, Scalable Comput. Pract. Exp..

[15]  Sathish S. Vadhiyar,et al.  Self adaptivity in Grid computing , 2005, Concurr. Pract. Exp..

[16]  Eduardo Huedo,et al.  Grid Resource Selection for Opportunistic Job Migration , 2003, Euro-Par.

[17]  Jack Dongarra,et al.  A Fault-Tolerant Communication Library for Grid Environments , 2003 .

[18]  Soon Young Jung,et al.  A Fault Tolerance Service for QoS in Grid Computing , 2003, International Conference on Computational Science.

[19]  Hai Jin,et al.  DRIC: Dependable Grid Computing Framework , 2006, IEICE Trans. Inf. Syst..

[20]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[21]  Gabrielle Allen,et al.  Nomadic Migration: Fault Tolerance in a Disruptive Grid Environment , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).