Predicting queue wait time probabilities for multi-scale computing

We describe a method for queue wait time prediction in supercomputing clusters. It was designed for use as a part of multi-criteria brokering mechanisms for resource selection in a multi-site High Performance Computing environment. The aim is to incorporate the time jobs stay queued in the scheduling system into the selection criteria. Our method can also be used by the end users to estimate the time to completion of their computing jobs. It uses historical data about the particular system to make predictions. It returns a list of probability estimates of the form (ti, pi), where pi is the probability that the job will start before time ti. Times ti can be chosen more or less freely when deploying the system. Compared to regression methods that only return a single number as a queue wait time estimate (usually without error bars) our prediction system provides more useful information. The probability estimates are calculated using the Bayes theorem with the naive assumption that the attributes describing the jobs are independent. They are further calibrated to make sure they are as accurate as possible, given available data. We describe our service and its REST API and the underlying methods in detail and provide empirical evidence in support of the method's efficacy. This article is part of the theme issue ‘Multiscale modelling, simulation and computing: from the desktop to the exascale’.

[1]  Yang Gao,et al.  Adaptive Job Scheduling for a Service Grid Using a Genetic Algorithm , 2003, GCC.

[2]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  Ivan Beschastnikh,et al.  SPRUCE: A System for Supporting Urgent High-Performance Computing , 2006, Grid-Based Problem Solving Environments.

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  Warren Smith,et al.  Resource Selection Using Execution and Queue Wait Time Predictions , 2002 .

[6]  Krzysztof Kurowski,et al.  Development of Science Gateways Using QCG — Lessons Learned from the Deployment on Large Scale Distributed and HPC Infrastructures , 2016, Journal of Grid Computing.

[7]  Peter V. Coveney,et al.  Multiscale computing in the exascale era , 2016, J. Comput. Sci..

[8]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[9]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[10]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[11]  Dieter Kranzlmüller,et al.  Towards a General Definition of Urgent Computing , 2015, ICCS.

[12]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[13]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[14]  F. Inti Pelupessy,et al.  Multi-physics simulations using a hierarchical interchangeable software interface , 2011, Comput. Phys. Commun..

[15]  Alfons G. Hoekstra,et al.  Foundations of distributed multiscale computing: Formalization, specification, and analysis , 2013, J. Parallel Distributed Comput..

[16]  Dieter Kranzlmüller,et al.  Leveraging e-Infrastructures for Urgent Computing , 2013, ICCS.

[17]  Hui Li,et al.  Predicting job start times on clusters , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[18]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[19]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[20]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[21]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[22]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[23]  Weiwei Xing,et al.  Optimizing MapReduce Partitioner Using Naive Bayes Classifier , 2017, 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).