Online Tuning of EASY-Backfilling using Queue Reordering Policies

The EASY-FCFS heuristic is the basic building block of job scheduling policies in most parallel High Performance Computing platforms. Despite its simplicity, and the guarantee of no job starvation, it could still be improved on a per-system basis. Such tuning is difficult because of non-linearities in the scheduling process. The study conducted in this paper considers an online approach to the automatic tuning of the EASY heuristic for HPC platforms. More precisely, we consider the problem of selecting a reordering policy for the job queue under several feedback modes. We show via a comprehensive experimental validation on actual logs that periodic simulation of historical data can be used to recover existing in-hindsight results that allow to divide the average waiting time by almost 2. This results holds even when the simulator results are noisy. Moreover, we show that good performances can still be obtained without a simulator, under what is called bandit feedback - when we can only observe the performance of the algorithm that was picked on the live system. Indeed, a simple multi-armed bandit algorithm can reduce the average waiting time by 40 percent.

[1]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[2]  Pierre-François Dutot,et al.  Batsim: A Realistic Language-Independent Resources and Jobs Management Systems Simulator , 2015, JSSPP.

[3]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4]  Dror G. Feitelson,et al.  Metrics for Parallel Job Scheduling and Their Convergence , 2001, JSSPP.

[5]  Xiangyu Li,et al.  Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[6]  Dror G. Feitelson Resampling with Feedback - A New Paradigm of Using Workload Data for Performance Evaluation , 2016, Euro-Par.

[7]  Achim Streit The self-tuning dynP job-scheduler , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[8]  Kaijun Ren,et al.  Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Moni Naor,et al.  Job Scheduling Strategies for Parallel Processing , 2017, Lecture Notes in Computer Science.

[10]  Dror G. Feitelson,et al.  Workload Modeling for Computer Systems Performance Evaluation , 2015 .

[11]  Denis Trystram,et al.  Improving backfilling by using machine learning to predict running times , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Kento Aida Effect of Job Size Characteristics on Job Scheduling Performance , 2000, JSSPP.

[13]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[14]  Michel Tokic,et al.  Adaptive epsilon-Greedy Exploration in Reinforcement Learning Based on Value Difference , 2010, KI.

[15]  Larry Rudolph,et al.  Metrics and Benchmarking for Parallel Job Scheduling , 1998, JSSPP.

[16]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[17]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[18]  Andrea C. Arpaci-Dusseau,et al.  The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance , 2002, JSSPP.

[19]  A. Banos On Pseudo-Games , 1968 .

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[22]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[23]  Uwe Schwiegelshohn,et al.  Parallel Job Scheduling - A Status Report , 2004, JSSPP.

[24]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[25]  Abhinav Vishnu,et al.  Fault Modeling of Extreme Scale Applications Using Machine Learning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  Peter J. Keleher,et al.  Randomization, Speculation, and Adaptation in Batch Schedulers , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[27]  Eelco Dolstra,et al.  The purely functional software deployment model , 2006 .

[28]  P. Sadayappan,et al.  Selective Reservation Strategies for Backfill Job Scheduling , 2002, JSSPP.

[29]  Dror G. Feitelson,et al.  Improving and stabilizing parallel computer performance using adaptive backfilling , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[30]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[31]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[32]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[33]  Bart De Schutter,et al.  Approximate reinforcement learning: An overview , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[34]  Dan Tsafrir,et al.  Instability in parallel job scheduling simulation: the role of workload flurries , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[35]  Denis Trystram,et al.  Tuning EASY-Backfilling Queues , 2017, JSSPP.