Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments

Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.

[1]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[2]  Wanling Gao,et al.  BigDataBench: A Dwarf-based Big Data and AI Benchmark Suite , 2018, ArXiv.

[3]  Stuart Barber,et al.  All of Statistics: a Concise Course in Statistical Inference , 2005 .

[4]  Kejiang Ye,et al.  Imbalance in the cloud: An analysis on Alibaba cluster trace , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[5]  Gerard Briscoe,et al.  Community Cloud Computing , 2009, CloudCom.

[6]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Lizy Kurian John,et al.  Modeling program resource demand using inherent program characteristics , 2011, SIGMETRICS.

[8]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[9]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[10]  Christina Delimitrou,et al.  Tarcil: High Quality and Low Latency Scheduling in Large, Shared Clusters , 2014 .

[11]  Robert N. M. Watson,et al.  Queues Don't Matter When You Can JUMP Them! , 2015, NSDI.

[12]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[13]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[14]  Krzysztof Rzadca,et al.  SLO-aware colocation of data center tasks based on instantaneous processor requirements , 2017, SoCC.

[15]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Christina Delimitrou,et al.  QoS-Aware scheduling in heterogeneous datacenters with paragon , 2013, TOCS.

[17]  Mor Harchol-Balter,et al.  AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers , 2012, TOCS.

[18]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[19]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[20]  Wu-chun Feng,et al.  MOON: MapReduce On Opportunistic eNvironments , 2010, HPDC '10.

[21]  Zhibin Yu,et al.  The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace , 2018, SoCC.

[22]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[23]  Donald F. Towsley,et al.  Modeling TCP throughput: a simple model and its empirical validation , 1998, SIGCOMM '98.

[24]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[25]  Mingfa Zhu,et al.  Minimizing Interference and Maximizing Progress for Hadoop Virtual Machines , 2015, PERV.

[26]  Asit K. Mishra,et al.  METE: meeting end-to-end QoS in multicores through system-wide resource management , 2011, PERV.

[27]  Michael Ferdman,et al.  Demystifying cloud benchmarking , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[28]  Xiao Zhang,et al.  Hardware Execution Throttling for Multi-core Resource Management , 2009, USENIX Annual Technical Conference.

[29]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[30]  Larry Wasserman,et al.  Models, Statistical Inference and Learning , 2004 .

[31]  Saurabh Bagchi,et al.  ICE: An Integrated Configuration Engine for Interference Mitigation in Cloud Services , 2015, 2015 IEEE International Conference on Autonomic Computing.

[32]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[33]  Huan Liu,et al.  A Measurement Study of Server Utilization in Public Clouds , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[34]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[35]  Mattan Erez,et al.  Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems , 2016, ASPLOS.

[36]  Jie Liu,et al.  Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines , 2011, SoCC.

[37]  Parijat Dube,et al.  The Unobservability Problem in Clouds , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[38]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[39]  Bowen Zhou,et al.  Mitigating interference in cloud services by middleware reconfiguration , 2014, Middleware.

[40]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[41]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[42]  Johan Tordsson,et al.  PerfGreen: Performance and Energy Aware Resource Provisioning for Heterogeneous Clouds , 2018, 2018 IEEE International Conference on Autonomic Computing (ICAC).

[43]  Francisco J. Cazorla,et al.  FlexDCP: a QoS framework for CMP architectures , 2009, OPSR.

[44]  Sameh Elnikety,et al.  PerfIso: Performance Isolation for Commercial Latency-Sensitive Services , 2018, USENIX Annual Technical Conference.

[45]  Anshul Gandhi,et al.  DIAL: Reducing Tail Latencies for Cloud Applications via Dynamic Interference-aware Load Balancing , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[46]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[47]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[48]  Umesh Bellur,et al.  Towards a comprehensive performance model of virtual machine live migration , 2015, SoCC.

[49]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[50]  Arpan Gujarati,et al.  Tableau: a high-throughput and predictable VM scheduler for high-density workloads , 2018, EuroSys.

[51]  Yin Wang,et al.  Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems , 2015, USENIX Annual Technical Conference.

[52]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[53]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[54]  Christina Delimitrou,et al.  PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services , 2019, ASPLOS.

[55]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).