Rescheduling for reliable job completion with the support of clouds

A major performance issue in large-scale decentralized distributed systems, such as grids, is how to ensure that jobs finish their execution within the estimated completion times in the presence of resource performance fluctuations. Previously, several techniques including advance reservation, rescheduling and migration have been adopted to resolve/relieve this issue; however, they have some non-negligent practicality hurdles. The use of clouds may be an attractive alternative, since resources in clouds are much more reliable than those in grids. This paper investigates the effectiveness of rescheduling using cloud resources to increase the reliability of job completion. Specifically, schedules are initially generated using grid resources, and cloud resources (relatively costlier) are used only for rescheduling to cope with a delay in job completion. A job in our study refers to a bag-of-tasks (BoT) application that consists of a large number of independent tasks; this job model is common in many science and engineering applications. We have devised a novel rescheduling technique, called rescheduling using clouds for reliable completion (RC^2) and applied it to three well-known existing heuristics. Our experimental results reveal that RC^2 significantly reduces delay in job completion.

[1]  Stuart E. Rogers,et al.  Steady and unsteady solutions of the incompressible Navier-Stokes equations , 1991 .

[2]  Oscar H. Ibarra,et al.  Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors , 1977, JACM.

[3]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[4]  Matthew Doar,et al.  A better model for generating test networks , 1996, Proceedings of GLOBECOM'96. 1996 IEEE Global Telecommunications Conference.

[5]  Albert Y. Zomaya,et al.  Resource-centric task allocation in grids with artificial danger model support , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[6]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[7]  R. F. Freund,et al.  Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[8]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[9]  Henri Casanova,et al.  Simgrid: a toolkit for the simulation of application scheduling , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10]  P. Matzinger The Danger Model: A Renewed Sense of Self , 2002, Science.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Joel R. Stiles,et al.  Monte Carlo simulation of neuro-transmitter release using MCell, a general simulator of cellular physiological processes , 1998 .

[13]  Albert Y. Zomaya,et al.  Practical Scheduling of Bag-of-Tasks Applications on Grids with Dynamic Resilience , 2007, IEEE Transactions on Computers.

[14]  Carl Hewitt,et al.  ORGs for Scalable, Robust, Privacy-Friendly Client Cloud Computing , 2008, IEEE Internet Computing.

[15]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[16]  Weisong Shi,et al.  An Adaptive Rescheduling Strategy for Grid Workflow Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[17]  Henri Casanova,et al.  Scheduling distributed applications: the SimGrid simulation framework , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[18]  Richard Wolski,et al.  Forecasting network performance to support dynamic scheduling using the network weather service , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).