The Dynamics of Backfilling: Solving the Mystery of Why Increased Inaccuracy May Help

Parallel job scheduling with backfilling requires users to provide runtime estimates, used by the scheduler to better pack the jobs. Studies of the impact of such estimates on performance have modeled them using a "badness factor" f ges 0 in an attempt to capture their inaccuracy (given a runtime r, the estimate is uniformly distributed in [r, (f + 1) middot r]). Surprisingly, inaccurate estimates (f > 0) yielded better performance than accurate ones (f = 0). We explain this by a "heel and toe" dynamics that, with f > 0, cause backfilling to approximate shortest-job first scheduling. We further find the effect of systematically increasing f is V-shaped: average wait time and slowdown initially drop, only to rise later on. This happens because higher fs create bigger "holes" in the schedule (longer jobs can backfill) and increase the randomness (more long jobs appear as short), thus overshadowing the initial heel-and-toe preference for shorter jobs. The bottom line is that artificial inaccuracy generated by multiplying (real or perfect) estimates by a factor is (1) just a scheduling technique that trades off fairness for performance, and is (2) ill-suited for studying the effect of real inaccuracy. Real estimates are modal (90% of the jobs use the same 20 estimates) and bounded by a maximum (usually the most popular estimate). Therefore, when performing an evaluation, "increased inaccuracy" should translate to increased modality. Unlike multiplying, this indeed worsens performance as one would intuitively expect

[1]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[2]  P. Sadayappan,et al.  Distributed job scheduling on computational Grids using multiple simultaneous requests , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[3]  Dror G. Feitelson,et al.  Parallel Job Scheduling under Dynamic Workloads , 2003, JSSPP.

[4]  Ibm Redbooks,et al.  Workload Management With Loadleveler , 2001 .

[5]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[6]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[7]  Dror G. Feitelson,et al.  Adaptive parallel job scheduling with flexible coscheduling , 2005, IEEE Transactions on Parallel and Distributed Systems.

[8]  Anand Sivasubramaniam,et al.  An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, IEEE Trans. Parallel Distributed Syst..

[9]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[10]  Gerald Sabin P. Sadayappan On Enhancing the Reliability of Job Schedulers ∗ , 2005 .

[11]  Dan Tsafrir,et al.  Instability in parallel job scheduling simulation: the role of workload flurries , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[13]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[14]  Fang Wang,et al.  Modeling of Workload in MPPs , 1997, JSSPP.

[15]  Anand Sivasubramaniam,et al.  An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, JSSPP.

[16]  Andrea C. Arpaci-Dusseau,et al.  The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance , 2002, JSSPP.

[17]  Mary K. Vernon,et al.  Characteristics of a Large Shared Memory Production Workload , 2001, JSSPP.

[18]  Jon B. Weissman,et al.  A new metric for robustness with application to job scheduling , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[19]  Dan Tsafrir,et al.  A Short Survey of Commercial Cluster Batch Schedulers , 2005 .

[20]  William Gropp,et al.  Exploring the relationship between parallel application run-time and network performance in clusters , 2003, 28th Annual IEEE International Conference on Local Computer Networks, 2003. LCN '03. Proceedings..

[21]  Dror G. Feitelson,et al.  Improved Utilization and Responsiveness with Gang Scheduling , 1997, JSSPP.

[22]  Dror G. Feitelson Experimental analysis of the root causes of performance evaluation results: a backfilling case study , 2005, IEEE Transactions on Parallel and Distributed Systems.

[23]  Jaspal Subhlok,et al.  Evaluating Job Scheduling Techniques for Highly Parallel Computers , 1995 .

[24]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[25]  Allen B. Downey A parallel workload model and its implications for processor allocation , 2004, Cluster Computing.

[26]  Anand Sivasubramaniam,et al.  Improving parallel job scheduling by combining gang scheduling and backfilling techniques , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[27]  J. Corbalan,et al.  Impact of Qualitative and Quantitative errors of the job runtime estimation in backfilling based scheduling policies , 2006 .

[28]  Cynthia Bailey Lee,et al.  Are User Runtime Estimates Inherently Inaccurate? , 2004, JSSPP.

[29]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[30]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[31]  Uwe Schwiegelshohn,et al.  On the Design and Evaluation of Job Scheduling Algorithms , 1999, JSSPP.

[32]  Dror G. Feitelson,et al.  Using Site-Level Modeling to Evaluate the Performance of Parallel System Schedulers , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[33]  Dan Tsafrir,et al.  Workload sanitation for performance evaluation , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[34]  Dan Tsafrir,et al.  Backfilling Using Runtime Predictions Rather Than User Estimates , 2005 .

[35]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[36]  Emmanuel Medernach,et al.  Workload Analysis of a Cluster in a Grid Environment , 2005, JSSPP.

[37]  Francine Berman,et al.  A comprehensive model of the supercomputer workload , 2001 .

[38]  Sangsuree Vasupongayya,et al.  Search-based Job Scheduling for Parallel Computer Workloads , 2005, 2005 IEEE International Conference on Cluster Computing.