The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance

The question of whether more accurate requested runtimes can significantly improve production parallel system performance has previously been studied for the FCFS-backfill scheduler, using a limited set of system performance measures. This paper examines the question for higher performance backfill policies, heavier system loads as are observed in current leading edge production systems such as the large Origin 2000 system at NCSA, and a broader range of system performance measures. The new results show that more accurate requested runtimes can improve system performance much more significantly than suggested in previous results. For example, average slowdown decreases by a factor of two to six, depending on system load and the fraction of jobs that have the more accurate requests. The new results also show that (a) nearly all of the performance improvement is realized even if the more accurate runtime requests are a factor of two higher than the actual runtimes, (b) most of the performance improvement is achieved when test runs are used to obtain more accurate runtime requests, and (c) in systems where only a fraction (e.g., 60%) of the jobs provide approximately accurate runtime requests, the users that provide the approximately accurate requests achieve even greater improvements in performance, such as an order of magnitude improvement in average slowdown for jobs that have runtime up to fifty hours.

[1]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[2]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[3]  Anand Sivasubramaniam,et al.  Improving parallel job scheduling by combining gang scheduling and backfilling techniques , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[4]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[5]  Mary K. Vernon,et al.  Characteristics of a Large Shared Memory Production Workload , 2001, JSSPP.

[6]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[7]  Mary K. Vernon,et al.  Production job scheduling for parallel shared memory systems , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[8]  Peter J. Keleher,et al.  Randomization, Speculation, and Adaptation in Batch Schedulers , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[10]  Francine Berman,et al.  A comprehensive model of the supercomputer workload , 2001 .

[11]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[12]  Anand Sivasubramaniam,et al.  An analysis of space-and time-sharing techniques for parallel job scheduling , 2001 .

[13]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.