Exploiting per user information for supercomputing workload prediction requires care

Efficient management of supercomputing facilities requires estimates of future workload based on past user behaviour. For supercomputers with large numbers of users, aggregate user behaviour is commonly assumed to be best in prediction of future workloads, however for systems with smaller numbers of users the question arises as to whether it is still suitable or if benefits can be derived from monitoring individual user behaviour to predict future workload. We compare using individual user behaviour, aggregate user behaviour and a hybrid approach where we track heavy users individually and cluster aggregate light users into a small number of clusters. We find that the hybrid approach produces the best results in both mean absolute error and mean squared error. However, treating all users separately provides slightly worse predictions. We also introduce a new approach to prediction based on the hazard function which is a significant improvement on previously used schemes based on autoregressive models. The schemes are investigated numerically using a two-year workload trace from a supercomputer with a population of 136 users.

[1]  Peter A. Dinda,et al.  Host load prediction using linear models , 2000, Cluster Computing.

[2]  Nagarajan Kandasamy,et al.  Power and performance management of virtualized computing environments via lookahead control , 2008, 2008 International Conference on Autonomic Computing.

[3]  Lachlan L. H. Andrew,et al.  Dynamic Right-Sizing for Power-Proportional Data Centers , 2011, IEEE/ACM Transactions on Networking.

[4]  L H AndrewLachlan,et al.  Dynamic right-sizing for power-proportional data centers , 2013 .

[5]  Klara Nahrstedt,et al.  Adaptive multi-resource prediction in distributed resource sharing environment , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[6]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[7]  Ian T. Foster,et al.  Homeostatic and tendency-based CPU load predictions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[8]  Jens Mache,et al.  A Comparative Study of Real Workload Traces and Synthetic Workload Models for Parallel Job Scheduling , 1998, JSSPP.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Elsayed A. Elsayed,et al.  Reliability Engineering , 1996 .

[11]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[12]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[13]  Jane-ling Wang Smoothing Hazard Rates , 2005 .

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  Moshe Shaked,et al.  Discrete hazard rate functions , 1995, Comput. Oper. Res..

[16]  Petre Stoica,et al.  Introduction to spectral analysis , 1997 .

[17]  Lachlan L. H. Andrew,et al.  Online algorithms for geographical load balancing , 2012, 2012 International Green Computing Conference (IGCC).

[18]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[19]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[20]  Allen B. Downey,et al.  A parallel workload model and its implications for processor allocation , 1996, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[21]  Mark S. Squillante,et al.  Analysis of Job Arrival Patterns and Parallel Scheduling Performance , 1999, Perform. Evaluation.

[22]  Subhash Saini,et al.  Performance prediction and its use in parallel and distributed computing systems , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[23]  Dror G. Feitelson,et al.  Packing Schemes for Gang Scheduling , 1996, JSSPP.

[24]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[25]  Tong Liu,et al.  Availability prediction and modeling of high mobility OSCAR cluster , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[26]  Kai Hwang,et al.  Adaptive Workload Prediction of Grid Performance in Confidence Windows , 2010, IEEE Transactions on Parallel and Distributed Systems.