Avalon: towards QoS awareness and improved utilization through multi-resource management in datacenters

Existing techniques for improving datacenter utilization while guaranteeing the QoS are based on the assumption that queries have similar behaviors. However, user queries in emerging compute demanding services demonstrate significantly diverse behavior and require adaptive parallelism. Our study shows that the end-to-end latency of the compute demanding query is determined together by the system-wide load, its workload, its parallelism, contention on shared cache, and memory bandwidth. When hosting such new services, the current cross-query resource allocation results in either severe QoS violation or significant resource under-utilization. To maximize hardware utilization while guaranteeing the QoS, we present Avalon, a runtime system that independently allocates shared resources for each query. Avalon first provides an automatic feature identification tool based on Lasso regression, to identify features that are relevant to a query's performance. Then, it establishes models that can precisely predict a query's duration under various resource configurations. Based on the accurate prediction model, Avalon proactively allocates "just-enough" cores and shared cache spaces to each query, so that the remaining resource can be assigned to execute best-effort applications. During runtime, Avalon monitors the progress of each query and mitigates any possible QoS violation due to memory bandwidth contention, occasional I/O contention, or unpredictable system interference. Our results show that Avalon improves utilization by 28.9% on average compared with state-of-the-art techniques while achieving 99%-ile latency target.

[1]  Hui Xiong,et al.  High-dimensional kNN joins with incremental updates , 2010, GeoInformatica.

[2]  Quan Chen,et al.  Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers , 2016, ASPLOS.

[3]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[4]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[5]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[7]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8]  Alan L. Cox,et al.  Adaptive parallelism for web search , 2013, EuroSys '13.

[9]  Chenyang Lu,et al.  Work stealing for interactive services to meet target latency , 2016, PPoPP.

[10]  Tibor Gyimóthy,et al.  Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.

[11]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[12]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[13]  Mattan Erez,et al.  Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems , 2016, ASPLOS.

[14]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[15]  Quan Chen,et al.  Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers , 2017, ASPLOS.

[16]  Robert Ricci,et al.  Rocksteady: Fast Migration for Low-latency In-memory Storage , 2017, SOSP.

[17]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[18]  Ricardo Bianchini,et al.  Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services , 2015, ASPLOS.

[19]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[20]  Daniel Mossé,et al.  Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[22]  Christoforos E. Kozyrakis,et al.  Memory Hierarchy for Web Search , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Michael B. Miller Linear Regression Analysis , 2013 .

[24]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[25]  Deepak Agarwal,et al.  Regression-based latent factor models , 2009, KDD.

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Quan Chen,et al.  PowerChief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained CMP , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[30]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[31]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[32]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[33]  Anand Sivasubramaniam,et al.  Worth their watts? - an empirical study of datacenter servers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[34]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[35]  Benjamin C. Lee,et al.  Cooper: Task Colocation with Cooperative Games , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[36]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[37]  Xiaodong Wang,et al.  SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[38]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  Mascon Global Limited Parallelizing a Computationally Intensive Financial R Application with Zircon Technology Zircon Computing LLC , 2010 .

[40]  F. Windmeijer,et al.  An R-squared measure of goodness of fit for some common nonlinear regression models , 1997 .

[41]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[42]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[43]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[44]  Yufeng Liu,et al.  VARIABLE SELECTION IN QUANTILE REGRESSION , 2009 .

[45]  Rami G. Melhem,et al.  Quality of service support for fine-grained sharing on GPUs , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).