VRM: A Failure-Aware Grid Resource Management System

For resource management in grid environments, advance reservations turned out to be very useful and hence are supported by a variety of grid toolkits. However, failure recovery for such systems has not yet received the attention it deserves. In this paper, we address the problem of remapping reservations to other resources, when the originally selected resource fails. Instead of dealing with jobs already running, which usually means checkpointing and migration, our focus is on jobs that are scheduled on the failed resource for a specific future period of time but not started yet. The most critical factor when solving this problem is the estimation of the downtime. We avoid the drawbacks of under- or overestimating the downtime by a dynamic load-based approach that is evaluated by extensive simulations in a grid environment and shows superior performance compared to estimation-based approaches.

[1]  Cees T. A. M. de Laat,et al.  TransLight: a global-scale LambdaGrid for e-science , 2003, CACM.

[2]  Jens Mache,et al.  A Comparative Study of Real Workload Traces and Synthetic Workload Models for Parallel Job Scheduling , 1998, JSSPP.

[3]  Shikharesh Majumdar,et al.  Impact of laxity on scheduling with advance reservations in grids , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[4]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[5]  Axel Keller,et al.  Anatomy of a Resource Management System for HPC-Clusters , 2000 .

[6]  Mark J. Clement,et al.  The Performance Impact of Advance Reservation Meta-scheduling , 2000, JSSPP.

[7]  Lars-Olof Burchard,et al.  A distributed load-based failure recovery mechanism for advance reservation environments , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  Lars-Olof Burchard,et al.  Failure Recovery in Distributed Environments with Advance Reservation Management Systems , 2004, DSOM.

[10]  Klara Nahrstedt,et al.  A distributed resource management architecture that supports advance reservations and co-allocation , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[11]  Vítor Santos Costa,et al.  ReGS: user-level reliability in a grid environment , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[12]  Odej Kao,et al.  Towards ontology-driven P2P grid resource discovery , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[13]  Warren Smith,et al.  Scheduling with advanced reservations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[14]  Ian T. Foster,et al.  SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems , 2002, JSSPP.

[15]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[16]  Albert G. Greenberg,et al.  Admission control for booking ahead shared resources , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[17]  Axel Keller,et al.  The virtual resource manager: an architecture for SLA-aware resource management , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..