One Can Only Gain by Replacing EASY Backfilling: A Simple Scheduling Policies Case Study

High-Performance Computing (HPC) platforms are growing in size and complexity. In order to improve the quality of service of such platforms, researchers are devoting a great amount of effort to devise algorithms and techniques to improve different aspects of performance such as energy consumption, total usage of the platform, and fairness between users. In spite of this, system administrators are always reluctant to deploy state of the art scheduling methods and most of them revert to EASY-backfilling, also known as EASY-FCFS (EASY-First-Come-First-Served). Newer methods frequently are complex and obscure and the simplicity and transparency of EASY are too important to sacrifice. In this work, we used execution logs from five HPC platforms to compare four simple scheduling policies: FCFS, Shortest estimated Processing time First (SPF), Smallest Requested Resources First (SQF), and Smallest estimated Area First (SAF). Using simulations, we performed a thorough analysis of the cumulative results for up to 180 weeks and considered three scheduling objectives: waiting time, slowdown and per-processor slowdown. We also evaluated other effects, such as the relationship between job size and slowdown, the distribution of slowdown values, and the number of backfilled jobs, for each HPC platform and scheduling policy. We conclude that one can only gain by replacing EASY-backfilling with SAF with backfilling, as it offers improvements in performance by up to 80% in the slowdown metric while maintaining the simplicity and the transparency of FCFS. Moreover, SAF reduces the number of jobs with large slowdowns and the inclusion of a simple thresholding mechanism guarantees that no starvation occurs. Finally, we propose SAF as a new benchmark for future scheduling studies.

[1]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[2]  Fatos Xhafa,et al.  Computational models and heuristic methods for Grid scheduling problems , 2010, Future Gener. Comput. Syst..

[3]  Nirwan Ansari,et al.  A Genetic Algorithm for Multiprocessor Scheduling , 1994, IEEE Trans. Parallel Distributed Syst..

[4]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[5]  Eric Gaussier,et al.  Online Tuning of EASY-Backfilling using Queue Reordering Policies , 2018, IEEE Transactions on Parallel and Distributed Systems.

[6]  Christodoulos A. Floudas,et al.  Mixed Integer Linear Programming in Process Scheduling: Modeling, Algorithms, and Applications , 2005, Ann. Oper. Res..

[7]  Guochuan Zhang,et al.  On-line scheduling of parallel jobs in a list , 2007, J. Sched..

[8]  Klaus Jansen,et al.  Approximation Algorithms for Multiple Strip Packing , 2009, WAOA.

[9]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[10]  Erik Elmroth,et al.  Towards understanding HPC users and systems: A NERSC case study , 2018, J. Parallel Distributed Comput..

[11]  Varghese S. Jacob,et al.  Heuristics and augmented neural networks for task scheduling with non-identical machines , 2006, Eur. J. Oper. Res..

[12]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[13]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[14]  Larry Rudolph,et al.  Metrics and Benchmarking for Parallel Job Scheduling , 1998, JSSPP.

[15]  Sergey Zhuk Approximate algorithms to pack rectangles into several strips , 2006 .

[16]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[17]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[18]  Dror G. Feitelson Resampling with Feedback - A New Paradigm of Using Workload Data for Performance Evaluation , 2016, Euro-Par.

[19]  Denis Trystram,et al.  Improving backfilling by using machine learning to predict running times , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Danilo Carastan-Santos,et al.  Obtaining Dynamic Scheduling Policies with Simulation and Machine Learning , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Johann Hurink,et al.  Online Algorithm for Parallel Job Scheduling and Strip Packing , 2007, WAOA.

[22]  Pierre-François Dutot,et al.  Batsim: A Realistic Language-Independent Resources and Jobs Management Systems Simulator , 2015, JSSPP.

[23]  Denis Trystram,et al.  Tuning EASY-Backfilling Queues , 2017, JSSPP.

[24]  Guochuan Zhang,et al.  Online multiple-strip packing , 2011, Theor. Comput. Sci..

[25]  Ronald L. Rivest,et al.  Orthogonal Packings in Two Dimensions , 1980, SIAM J. Comput..

[26]  Douglas G. Down,et al.  Power-Aware Linear Programming based Scheduling for heterogeneous computer clusters , 2010, International Conference on Green Computing.

[27]  Albert Y. Zomaya,et al.  A New Genetic Algorithm for Scheduling for Large Communication Delays , 2009, Euro-Par.

[28]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..