Improving backfilling by using machine learning to predict running times

The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective.

[1]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[2]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[3]  Francine Berman,et al.  Using stochastic intervals to predict application behavior on contended resources , 1999, Proceedings Fourth International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'99).

[4]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[5]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[6]  Dror G. Feitelson,et al.  Metrics for Parallel Job Scheduling and Their Convergence , 2001, JSSPP.

[8]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[9]  José A. B. Fortes,et al.  On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[10]  Dror G. Feitelson,et al.  Probabilistic Backfilling , 2007, JSSPP.

[11]  Daniel A. Reed,et al.  Integrated compilation and scalability analysis for parallel systems , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[12]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[13]  Radu Prodan,et al.  A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  John Langford,et al.  Normalized Online Learning , 2013, UAI.

[15]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[16]  Larry Rudolph,et al.  Job Scheduling Strategies for Parallel Processing: 7th International Workshop, JSSPP 2001, Cambridge, MA, USA, June 16, 2001, Revised Papers , 2001 .

[17]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[18]  Jennifer M. Schopf,et al.  PBS Pro: Grid computing and scheduling attributes , 2004 .

[19]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[20]  Ian Foster,et al.  Predicting application run times with historical information , 2004, J. Parallel Distributed Comput..

[21]  Yiannis Georgiou,et al.  Contributions for Resource and Job Management in High Performance Computing. (Contributions à la Gestion de Ressources et de Tâches pour le Calcul de Haute Performance) , 2010 .

[22]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[23]  Dror G. Feitelson,et al.  Pitfalls in Parallel Job Scheduling Evaluation , 2005, JSSPP.

[24]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.