Applying semantics to grid middleware

Multi-site parallel job schedulers can improve average job turn-around time by making use of fragmented node resources available throughout the grid. By mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in this area of scheduling leverages user-provided estimates of job communication characteristics to more effectively partition the job across system resources. In this paper, we address the impact of inaccuracies in these estimates on system performance and show that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are instances where these errors result in poor job scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application performance and turnaround time. Consequently, we explore the use of job checkpointing, termination, migration, and restart (CTMR) to selectively stop offending jobs to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which the process of CTMR improves overall performance. We demonstrate that this technique is beneficial even when the overhead of doing so is costly. Copyright © 2009 John Wiley & Sons, Ltd. This is an extended version of ‘Using Checkpointing to Recover from Poor Multi-site Parallel Job Scheduling Decisions’ that appeared in the 5th International Workshop on Middleware for Grid Computing, held in conjunction with the 2007ACM-IFIP-USENIX 8th International Middleware Conference, 26 November 2007, [1].

[1]  Anca I. D. Bucur,et al.  The Performance of Processor Co-Allocation in Multicluster Systems , 2003, CCGRID.

[2]  Johan Vounckx,et al.  Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback , 1993 .

[3]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[4]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[5]  William M. Jones Using checkpointing to recover from poor multi-site parallel job scheduling decisions , 2007, MGC '07.

[6]  Dror G. Feitelson,et al.  Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860 , 1995, JSSPP.

[7]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[8]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[9]  Lei Gao,et al.  Multisite co-allocation scheduling algorithms for parallel jobs in computing grid environments , 2006, Science in China Series F: Information Sciences.

[10]  Cynthia Bailey Lee,et al.  Are User Runtime Estimates Inherently Inaccurate? , 2004, JSSPP.

[11]  William M. Jones The impact of error in user-provided bandwidth estimates on multi-site parallel job scheduling performance , 2007 .

[12]  John Ngubiri,et al.  Group-Wise Performance Evaluation of Processor Co-allocation in Multi-cluster Systems , 2007, JSSPP.

[13]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[14]  Daniel C. Stanzione,et al.  Characterization of Bandwidth-Aware Meta-Schedulers for Co-Allocating Jobs Across Multiple Clusters , 2005, The Journal of Supercomputing.

[15]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[16]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[17]  Achim Streit,et al.  Enhanced Algorithms for Multi-site Scheduling , 2002, GRID.

[18]  John T. Daly,et al.  Application Resilience: Making Progress in Spite of Failure , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[19]  Andrea C. Arpaci-Dusseau,et al.  The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance , 2002, JSSPP.

[20]  Jinhui Qin,et al.  An Improved Job Co-Allocation Strategy in Multiple HPC Clusters , 2007, 21st International Symposium on High Performance Computing Systems and Applications (HPCS'07).