Dynamic Resource Shaping for Compute Clusters

Nowadays, data-centers are largely under-utilized because resource allocation is based on reservation mechanisms which ignore actual resource utilization. Indeed, it is common to reserve resources for peak demand, which may occur only for a small portion of the application life time. As a consequence, cluster resources often go under-utilized. In this work, we propose a mechanism that improves compute cluster utilization and their responsiveness, while preventing application failures due to contention in accessing finite resources such as RAM. Our method monitors resource utilization and employs a data-driven approach to resource demand forecasting, featuring quantification of uncertainty in the predictions. Using demand forecast and its confidence, our mechanism modulates cluster resources assigned to running applications, and reduces the turnaround time by more than one order of magnitude while keeping application failures under control. Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization.

[1]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[2]  Charles Anderson,et al.  Docker , 2015, IEEE Softw..

[3]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[4]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[5]  Kang G. Shin,et al.  Adaptive control of virtualized resources in utility computing environments , 2007, EuroSys '07.

[6]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[7]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.

[8]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[9]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[10]  Pietro Michiardi,et al.  OS-Assisted Task Preemption for Hadoop , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[11]  Ion Stoica,et al.  True elasticity in multi-tenant data-intensive compute clusters , 2012, SoCC '12.

[12]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[13]  Erik Elmroth,et al.  Incentivizing self-capping to increase cloud utilization , 2017, SoCC.

[14]  Zhengping Qian,et al.  Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters , 2017, EuroSys.

[15]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[16]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[17]  김종영 구글 TensorFlow 소개 , 2015 .

[18]  Willy Zwaenepoel,et al.  Eagle : A Better Hybrid Data Center Scheduler , 2016 .

[19]  Srikanth Kandula,et al.  Efficient queue management for cluster scheduling , 2016, EuroSys.

[20]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[21]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[22]  Maurizio Filippone,et al.  Random Feature Expansions for Deep Gaussian Processes , 2016, ICML.

[23]  Daniele Venzano,et al.  Flexible Scheduling of Distributed Analytic Applications , 2016, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[24]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[25]  Thomas B. Schön,et al.  Computationally Efficient Bayesian Learning of Gaussian Process State Space Models , 2015, AISTATS.

[26]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[27]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[28]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[29]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[30]  T. Lawson,et al.  Spark , 2011 .

[31]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[32]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[33]  Roger Frigola,et al.  Bayesian Time Series Learning with Gaussian Processes , 2015 .

[34]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[35]  Iain Murray,et al.  A framework for evaluating approximation methods for Gaussian process regression , 2012, J. Mach. Learn. Res..

[36]  Carlo Curino,et al.  ERA: A Framework for Economic Resource Allocation for the Cloud , 2017, WWW.

[37]  Dick H. J. Epema,et al.  Dynamically Scheduling a Component-Based Framework in Clusters , 2014, JSSPP.

[38]  Carl E. Rasmussen,et al.  Variational Gaussian Process State-Space Models , 2014, NIPS.

[39]  Willy Zwaenepoel,et al.  Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling , 2017, USENIX Annual Technical Conference.

[40]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[41]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[42]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[43]  Dick H. J. Epema,et al.  KOALA-F: A Resource Manager for Scheduling Frameworks in Clusters , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).