Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms

Cloud research to date has lacked data on the characteristics of the production virtual machine (VM) workloads of large cloud providers. A thorough understanding of these characteristics can inform the providers' resource management systems, e.g. VM scheduler, power manager, server health manager. In this paper, we first introduce an extensive characterization of Microsoft Azure's VM workload, including distributions of the VMs' lifetime, deployment size, and resource consumption. We then show that certain VM behaviors are fairly consistent over multiple lifetimes, i.e. history is an accurate predictor of future behavior. Based on this observation, we next introduce Resource Central (RC), a system that collects VM telemetry, learns these behaviors offline, and provides predictions online to various resource managers via a general client-side library. As an example of RC's online use, we modify Azure's VM scheduler to leverage predictions in oversubscribing servers (with oversubscribable VM types), while retaining high VM performance. Using real VM traces, we then show that the prediction-informed schedules increase utilization and prevent physical resource exhaustion. We conclude that providers can exploit their workloads' characteristics and machine learning to improve resource management substantially.

[1]  Xifeng Yan,et al.  Workload characterization and prediction in the cloud: A multiple time series approach , 2012, 2012 IEEE Network Operations and Management Symposium.

[2]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[3]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[4]  Akshat Verma,et al.  pMapper: Power and Migration Cost Aware Application Placement in Virtualized Systems , 2008, Middleware.

[5]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[6]  Andrzej Kochut,et al.  Dynamic Placement of Virtual Machines for Managing SLA Violations , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[7]  Rajkumar Buyya,et al.  Energy Efficient Resource Management in Virtualized Cloud Data Centers , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[8]  Aniruddha S. Gokhale,et al.  Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[9]  Bin Li,et al.  Dynamo: Facebook's Data Center-Wide Power Management System , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[11]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[12]  Xiaowei Yang,et al.  CloudCmp: comparing public cloud providers , 2010, IMC '10.

[13]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[14]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[15]  Sheng Di,et al.  Characterization and Comparison of Cloud versus Grid Workloads , 2012, 2012 IEEE International Conference on Cluster Computing.

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  R. Preston McAfee,et al.  Usage Patterns and the Economics of the Public Cloud , 2017, WWW.

[18]  Rajkumar Buyya,et al.  Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS , 2015, IEEE Transactions on Cloud Computing.

[19]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[20]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[21]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[22]  Kevin Lee,et al.  Empirical prediction models for adaptive resource provisioning in the cloud , 2012, Future Gener. Comput. Syst..

[23]  Kang G. Shin,et al.  Automated control of multiple virtualized resources , 2009, EuroSys '09.

[24]  Keqiang He,et al.  Next stop, the cloud: understanding modern web service deployment in EC2 and azure , 2013, Internet Measurement Conference.

[25]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.