Online Scheduling with Redirection for Parallel Jobs

An important component of High Performance Computing (HPC) clusters is the job scheduling algorithm, which decides the allocation and the scheduling of the jobs in the system. Such scheduling algorithms need to be scalable to confront the growth both in size and in complexity of the modern clusters. We propose in this paper a new algorithm for scheduling parallel jobs with redirection. Specifically, our algorithm redirects the jobs whose execution affects significantly an important number of other jobs. A redirected job is stopped and restarted from the beginning in a dedicated part of the cluster. We show the effectiveness of our method through an intensive experimental campaign of simulations of production cluster log traces.

[1]  Dror G. Feitelson,et al.  Metrics for Parallel Job Scheduling and Their Convergence , 2001, JSSPP.

[2]  Denis Trystram,et al.  A New On-line Method for Scheduling Independent Tasks , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[3]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[4]  Millian Poquet,et al.  Simulation approach for resource management , 2017 .

[5]  David Glesser,et al.  Road to exascale: Improving scheduling performances and reducing energy consumption with the help of end-users. (En route vers l'exascale: améliorer les performances de l'ordonnancement et réduire la consommation énergétique avec l'aide des utilisateurs finaux) , 2016 .

[6]  Eric Gaussier,et al.  Online Tuning of EASY-Backfilling using Queue Reordering Policies , 2018, IEEE Transactions on Parallel and Distributed Systems.

[7]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..