Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems

Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this variation and exploit the opportunities of mitigating variation directly to simultaneously improve both QoS and utilization. We develop, implement, and evaluate Dirigent, a lightweight performance-management runtime system that accurately controls the QoS of latency-critical applications at fine time scales, leveraging existing architecture mechanisms. We evaluate Dirigent on a real machine and show that it is significantly more effective than configurations representative of prior schemes.

[1]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[2]  Tipp Moseley,et al.  Measuring interference between live datacenter applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  H. T. Kung,et al.  Mobile App Acceleration via Fine-Grain Offloading to the Cloud , 2014, HotCloud.

[6]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[7]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[8]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[9]  Yun Chen,et al.  Supporting Differentiated Services in Computers via Programmable Architecture for Resourcing-on-Demand (PARD) , 2015, ASPLOS.

[10]  Henry Hoffmann,et al.  Application heartbeats: a generic interface for specifying program performance and goals in autonomous computing environments , 2010, ICAC '10.

[11]  Yale N. Patt,et al.  Predicting Performance Impact of DVFS for Realistic Memory Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[13]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[14]  G. Edward Suh,et al.  Prediction-guided performance-energy trade-off for interactive applications , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Chenyang Lu,et al.  An adaptive control framework for QoS guarantees and its application to differentiated caching , 2002, IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564).

[16]  Tao Chen,et al.  Execution time prediction for energy-efficient hardware accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Ramesh Illikkal,et al.  Rate-based QoS techniques for cache/memory in CMP platforms , 2009, ICS.

[18]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[19]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[20]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[21]  Daniel Sánchez,et al.  Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[23]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Marios C. Papaefthymiou,et al.  Computational sprinting , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[25]  David Wentzlaff,et al.  The sharing architecture: sub-core configurability for IaaS clouds , 2014, ASPLOS.

[26]  Yan Solihin,et al.  QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[27]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[28]  Silvio Savarese,et al.  MEVBench: A mobile computer vision benchmarking suite , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[30]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[31]  Azer Bestavros,et al.  Statistical rate monotonic scheduling , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[32]  Fang Liu,et al.  Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[33]  Navjot Singh,et al.  Supporting soft real-time tasks in the xen hypervisor , 2010, VEE '10.

[34]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[35]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[36]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[37]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[38]  Onur Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.

[39]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[40]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[41]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[42]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[43]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[44]  Lui Sha,et al.  MemGuard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms , 2013, 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[45]  T. N. Vijaykumar,et al.  TimeTrader: Exploiting latency tail to save datacenter energy for online search , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Bharat K. Bhargava,et al.  A Survey of Computation Offloading for Mobile Systems , 2012, Mobile Networks and Applications.

[47]  Alan Burns,et al.  Real Time Scheduling Theory: A Historical Perspective , 2004, Real-Time Systems.

[48]  Ravi R. Iyer,et al.  CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[49]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[50]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[51]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[52]  Ravi Iyer,et al.  PIRATE: QoS and performance management in CMP architectures , 2010, PERV.

[53]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[54]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[55]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[56]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.