Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P

Backfilling and short-job-first are widely acknowledged enhancements to the simple but popular first-come, first-served job scheduling policy. However, both enhancements depend on user-provided estimates of job runtime, which research has repeatedly shown to be inaccurate. We have investigated the effects of this inaccuracy on backfilling and different queue prioritization policies, determining which part of the scheduling policy is most sensitive. Using these results, we have designed and implemented several estimation-adjusting schemes based on historical data. We have evaluated these schemes using workload traces from the Blue Gene/P system at Argonne National Laboratory. Our experimental results demonstrate that dynamically adjusting job runtime estimates can improve job scheduling performance by up to 20%.

[1]  Bharadwaj Veeravalli,et al.  Design and performance evaluation of combined first-fit task allocation and migration strategies in mesh multiprocessor systems , 2008, Parallel Comput..

[2]  Chita R. Das,et al.  A Fast and Efficient Processor Allocation Scheme for Mesh-Connected Multicomputers , 2002, IEEE Trans. Computers.

[3]  Dan Tsafrir,et al.  A Short Survey of Commercial Cluster Batch Schedulers , 2005 .

[4]  Warren Smith Prediction Services for Distributed Computing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Allen B. Downey Predicting queue times on space-sharing parallel computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[6]  Ian Foster,et al.  Predicting application run times with historical information , 2004, J. Parallel Distributed Comput..

[7]  Mary K. Vernon,et al.  Production job scheduling for parallel shared memory systems , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[8]  Shonali Krishnaswamy,et al.  Estimating computation times of data-intensive applications , 2004, IEEE Distributed Systems Online.

[9]  Zhiling Lan,et al.  Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Jens Volkert,et al.  A Three-Phase Adaptive Prediction System of the Run-Time of Jobs Based on User Behaviour , 2009, 2009 International Conference on Complex, Intelligent and Software Intensive Systems.

[11]  Pavan Balaji,et al.  Improving Resource Availability by Relaxing Network Allocation Constraints on Blue Gene/P , 2009, 2009 International Conference on Parallel Processing.

[12]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[13]  Phillip Krueger,et al.  ob Scheduling is More Important than Processor Allocation for Hypercube Computers , 1994, IEEE Trans. Parallel Distributed Syst..

[14]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[15]  Peter J. Keleher,et al.  Randomization, Speculation, and Adaptation in Batch Schedulers , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[16]  Jens Volkert,et al.  An Architecture for an Adaptive Run-time Prediction System , 2008, 2008 International Symposium on Parallel and Distributed Computing.

[17]  Dan Tsafrir,et al.  The Dynamics of Backfilling: Solving the Mystery of Why Increased Inaccuracy May Help , 2006, 2006 IEEE International Symposium on Workload Characterization.

[18]  Cynthia Bailey Lee,et al.  Are User Runtime Estimates Inherently Inaccurate? , 2004, JSSPP.

[19]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[20]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[21]  John E. West,et al.  Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy , 2002, JSSPP.

[22]  Andrea C. Arpaci-Dusseau,et al.  The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance , 2002, JSSPP.

[23]  James Patton Jones,et al.  Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[24]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[25]  Anand Sivasubramaniam,et al.  Improving parallel job scheduling by combining gang scheduling and backfilling techniques , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[26]  Francine Berman,et al.  A comprehensive model of the supercomputer workload , 2001 .

[27]  Dror G. Feitelson,et al.  Pitfalls in Parallel Job Scheduling Evaluation , 2005, JSSPP.

[28]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[29]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..