A Methodology for Online Consolidation of Tasks through More Accurate Resource Estimations

Cloud providers aim to provide computing services for a wide range of applications, such as web applications, emails, web searches, map reduce jobs. These applications are commonly scheduled to run on multi-purpose clusters that nowadays are becoming larger and more heterogeneous. A major challenge is to efficiently utilize the cluster's available resources, in particular to maximize the machines' utilization level while minimizing the applications' waiting time. We studied a publicly available trace from a large Google cluster (i12,000 machines) and observed that users generally request more resources than required for running their tasks, leading to low levels of utilization. In this paper, we propose a methodology for achieving an efficient utilization of the cluster's resources while providing the users with fast and reliable computing services. The methodology consists of three main modules: i) a prediction module that forecasts the maximum resource requirement of a task, ii) a scalable scheduling module that efficiently allocates tasks to machines, and iii) a monitoring module that tracks the levels of utilization of the machines and tasks. We present results that show that the impact of more accurate resource estimations for the scheduling of tasks can lead to an increase in the average utilization of the cluster, a reduction in the number of tasks being evicted, and a reduction in the tasks' waiting time.

[1]  H. Pat Artis Capacity planning for MVS computer systems , 1979, PERV.

[2]  Markus P. J. Fromherz,et al.  Constraint-based scheduling , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  S. Elnaffar,et al.  Techniques and a Framework for Characterizing Computer Systems' Workloads , 2006, 2006 Innovations in Information Technology.

[5]  David Simchi-Levi,et al.  The asymptotic performance ratio of an on-line algorithm for uniform parallel machine scheduling with release dates , 2001, Math. Program..

[6]  J. Koomey Worldwide electricity used in data centers , 2008 .

[7]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[8]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[9]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[10]  Ying Wang,et al.  Scheduling Mixed Real-Time and Non-real-Time Applications in MapReduce Environment , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[11]  Joseph L. Hellerstein,et al.  Obfuscatory obscanturism: Making workload traces of commercially-sensitive systems safe to release , 2012, 2012 IEEE Network Operations and Management Symposium.

[12]  Sheng Di,et al.  Characterization and Comparison of Cloud versus Grid Workloads , 2012, 2012 IEEE International Conference on Cluster Computing.

[13]  R Hawtin,et al.  EPSRC-JISC report: Cost Analysis of Cloud Computing for Research , 2012 .

[14]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[15]  Liam Murphy,et al.  A Cost-Capacity Analysis for Assessing the Efficiency of Heterogeneous Computing Assets in an Enterprise Cloud , 2013, 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing.

[16]  Samuel Kounev,et al.  Self‐adaptive workload classification and forecasting for proactive resource provisioning , 2013, ICPE '13.

[17]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[18]  Giuseppe Serazzi,et al.  On load balancing: a mix-aware algorithm for heterogeneous systems , 2013, ICPE '13.

[19]  Franck Cappello,et al.  Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[20]  Zibin Zheng,et al.  Particle Swarm Optimization for Energy-Aware Virtual Machine Placement Optimization in Virtualized Data Centers , 2013, 2013 International Conference on Parallel and Distributed Systems.

[21]  Roland R. Draxler,et al.  Root mean square error (RMSE) or mean absolute error (MAE) , 2014 .