On the Scheduling of Checkpoints in Desktop Grids

Frequent resources failures are a major challenge for the rapid completion of batch jobs. Check pointing and migration is one approach to accelerate job completion avoiding deadlock. We study the problem of scheduling checkpoints of sequential jobs in the context of Desktop Grids, consisting of volunteered distributed resources. We craft a checkpoint scheduling algorithm that is provably optimal for discrete time when failures obey any general probability distribution. We show using simulations with parameters based on real-world systems that this optimal strategy scales and outperforms other strategies significantly in terms of check pointing costs and batch completion times.

[1]  Vijay S. Pande,et al.  Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problem , 2009, 0901.0866.

[2]  Alexandru Iosup,et al.  How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[3]  Gilles Fedak,et al.  The Computational and Storage Potential of Volunteer Computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[6]  David P. Anderson,et al.  Correlated Resource Models of Internet End Hosts , 2010, 2011 31st International Conference on Distributed Computing Systems.

[7]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8]  Jean-Marc Vincent,et al.  Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Henri Casanova,et al.  SimGrid: A Generic Framework for Large-Scale Distributed Experiments , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[10]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[11]  Trilce Estrada,et al.  Modeling Job Lifespan Delays in Volunteer Computing Projects , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[12]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[13]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[15]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[16]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..