Fault-tolerant Grid Resource Management Infrastructure

The main motivation for existing Grid systems is to provide mechanisms for sharing and accessing large and heterogeneous collections of remote resources. This remains the primary goal even today. However, achieving large-scale distributed computing in a seamless manner on Grid computing introduces not only the problem of efficient utilization and satisfactory response time but also the problem of fault-tolerance. With the momentum gaining for the Grid computing, the ability to tolerate failures while effectively exploiting the Grid computing resources in a scalable and transparent manner must be an integral part of Grid computing infrastructure. In this paper, we present a reconfigurable multi-layered Grid infrastructure that provides faulttolerance mechanisms to ensure that a Grid client can obtain reliable services, even if the middleware service that provides the desired services may suffer from crash failures.

[1]  Jemal H. Abawajy,et al.  Scheduling parallel jobs with CPU and I/O resource requirements in cluster computing systems , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[2]  Douglas Thain,et al.  The Ethernet approach to grid computing , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[3]  Alok Shriram,et al.  A Scheduling Model for Grid Computing Systems , 2001, GRID.

[4]  Louise E. Moser,et al.  Surviving Network Partitioning , 1998, Computer.

[5]  Jemal H. Abawajy,et al.  A Reconfigurable Multi-Layered Grid Scheduling Infrastructure , 2003, PDPTA.

[6]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[7]  Sathish S. Vadhiyar,et al.  A metascheduler for the Grid , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[8]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[9]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[10]  Francine Berman,et al.  The GrADS Project: Software Support for High-Level Grid Application Development , 2001, Int. J. High Perform. Comput. Appl..

[11]  Jemal H. Abawajy Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems , 2003, International Conference on Computational Science.

[12]  Andrew S. Grimshaw,et al.  Wide-Area Computing: Resource Sharing on a Large Scale , 1999, Computer.

[13]  Jemal H. Abawajy,et al.  Parallel job scheduling on multicluster computing system , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[14]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[15]  Andrew S. Grimshaw,et al.  Integrating fault-tolerance techniques in grid applications , 2000 .

[16]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.