CuttleSys: Data-Driven Resource Management forInteractive Applications on Reconfigurable Multicores

Multi-tenancy for latency-critical applications leads to re-source interference and unpredictable performance. Core reconfiguration opens up more opportunities for colocation,as it allows the hardware to adjust to the dynamic performance and power needs of a specific mix of co-scheduled applications. However, reconfigurability also introduces challenges, as even for a small number of reconfigurable cores, exploring the design space becomes more time- and resource-demanding. We present CuttleSys, a runtime for reconfigurable multi-cores that leverages scalable and lightweight data mining to quickly identify suitable core and cache configurations for a set of co-scheduled applications. The runtime combines collaborative filtering to infer the behavior of each job on every core and cache configuration, with Dynamically Dimensioned Search to efficiently explore the configuration space. We evaluate CuttleSys on multicores with tens of reconfigurable cores and show up to 2.46x and 1.55x performance improvements compared to core-level gating and oracle-like asymmetric multicores respectively, under stringent power constraints.

[1]  Furat Afram,et al.  FlexCore: A Reconfigurable Processor Supporting Flexible, Dynamic Morphing , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[2]  Lizy Kurian John,et al.  Predictive coordination of multiple on-chip resources for chip multiprocessors , 2011, ICS '11.

[3]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[4]  Hans-Martin Gutmann,et al.  A Radial Basis Function Method for Global Optimization , 2001, J. Glob. Optim..

[5]  Kai Ma,et al.  PGCapping: Exploiting power gating for power capping and core lifetime balancing in CMPs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[7]  Nam Sung Kim,et al.  RCS: Runtime resource and core scaling for power-constrained multi-core processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[8]  Israel Koren,et al.  An opportunistic prediction-based thread scheduling to maximize throughput/watt in AMPs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Daniel Mossé,et al.  Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10]  Amin Ansari,et al.  Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[12]  Rajesh Kumar,et al.  A family of 45nm IA processors , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[13]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[14]  Shaolei Ren,et al.  Exploiting Processor Heterogeneity in Interactive Services , 2013, ICAC.

[15]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[16]  Shaolei Ren,et al.  A Theoretical Foundation for Scheduling and Designing Heterogeneous Processors for Interactive Applications , 2014, DISC.

[17]  Thomas F. Wenisch,et al.  DreamWeaver: architectural support for deep sleep , 2012, ASPLOS XVII.

[18]  Gero Dittmann,et al.  Exploring power management in multi-core systems , 2008, 2008 Asia and South Pacific Design Automation Conference.

[19]  Christina Delimitrou,et al.  Quality-of-Service-Aware Scheduling in Heterogeneous Data centers with Paragon , 2014, IEEE Micro.

[20]  Robert M. Bell,et al.  The BellKor 2008 Solution to the Netflix Prize , 2008 .

[21]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[22]  Gu-Yeon Wei,et al.  Tradeoffs between power management and tail latency in warehouse-scale applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Christine A. Shoemaker,et al.  Scalable thread scheduling and global power management for heterogeneous many-core architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Stijn Eyerman,et al.  Maximizing Heterogeneous Processor Performance Under Power Constraints , 2016, ACM Trans. Archit. Code Optim..

[25]  Christine A. Shoemaker,et al.  SO-MI: A surrogate model algorithm for computationally expensive nonlinear mixed-integer black-box global optimization problems , 2013, Comput. Oper. Res..

[26]  Lizy Kurian John,et al.  Efficient program scheduling for heterogeneous multi-core processors , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[27]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28]  Eric Rotenberg,et al.  AnyCore: A synthesizable RTL model for exploring and fabricating adaptive superscalar cores , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[29]  Dimitris Gizopoulos,et al.  Adaptive Voltage/Frequency Scaling and Core Allocation for Balanced Energy and Performance on Multicore CPUs , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30]  Kai Ma,et al.  Scalable power control for many-core architectures running multi-threaded applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[31]  Henry Hoffmann,et al.  CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[33]  Christina Delimitrou,et al.  PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services , 2019, ASPLOS.

[34]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[35]  Eric Rotenberg,et al.  A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[36]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[37]  Christine A. Shoemaker,et al.  Flicker: a dynamically adaptive architecture for power limited multicore systems , 2013, ISCA.

[38]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[39]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[41]  Diana Marculescu,et al.  Dynamic thread mapping for high-performance, power-efficient heterogeneous many-core systems , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[42]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[43]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[44]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[45]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[46]  Lingjia Tang,et al.  SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters , 2018, ASPLOS.

[47]  Quan Chen,et al.  Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers , 2017, ASPLOS.

[48]  Stijn Eyerman,et al.  Mind The Power Holes: Sifting Operating Points in Power-Limited Heterogeneous Multicores , 2017, IEEE Computer Architecture Letters.

[49]  Josep Torrellas,et al.  Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[50]  Christina Delimitrou,et al.  Bolt: I Know What You Did Last Summer... In The Cloud , 2017, ASPLOS.

[51]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[52]  Axel Jantsch,et al.  SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management , 2018, ASPLOS.

[53]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[54]  Lizy Kurian John,et al.  Predictive Heterogeneity-Aware Application Scheduling for Chip Multiprocessors , 2014, IEEE Transactions on Computers.

[55]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[56]  C. Shoemaker,et al.  Combining radial basis function surrogates and dynamic coordinate search in high-dimensional expensive black-box optimization , 2013 .

[57]  Indrani Paul,et al.  Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[58]  Luca Benini,et al.  Thermal and Energy Management of High-Performance Multicores: Distributed and Self-Calibrating Model-Predictive Controller , 2013, IEEE Transactions on Parallel and Distributed Systems.

[59]  Scott A. Mahlke,et al.  Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[60]  Christina Delimitrou,et al.  Mage: online and interference-aware scheduling for multi-scale heterogeneous systems , 2018, PACT.

[61]  Christina Delimitrou,et al.  QoS-Aware scheduling in heterogeneous datacenters with paragon , 2013, TOCS.

[62]  Hong Wang,et al.  Post-Silicon CPU Adaptation Made Practical Using Machine Learning , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[63]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[64]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[65]  Scott A. Mahlke,et al.  Trace based phase prediction for tightly-coupled heterogeneous cores , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[66]  Henry Hoffmann,et al.  Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques , 2016, ASPLOS.

[67]  Patrick Crowley,et al.  Dynamic thread assignment on heterogeneous multiprocessor architectures , 2006, CF '06.

[68]  Kai Ma,et al.  Temperature-constrained power control for chip multiprocessors with online model estimation , 2009, ISCA '09.

[69]  David Wentzlaff,et al.  The sharing architecture: sub-core configurability for IaaS clouds , 2014, ASPLOS 2014.

[70]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[71]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[72]  Lieven Eeckhout,et al.  Chrysso: an integrated power manager for constrained many-core processors , 2015, Conf. Computing Frontiers.

[73]  Christine A. Shoemaker,et al.  Local function approximation in evolutionary algorithms for the optimization of costly functions , 2004, IEEE Transactions on Evolutionary Computation.

[74]  Wei Zhang,et al.  Dynamic core scaling: Trading off performance and energy beyond DVFS , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[75]  C. F. Jeff Wu,et al.  Experiments , 2021, Wiley Series in Probability and Statistics.

[76]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[77]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[78]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[79]  Christina Delimitrou,et al.  Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency , 2018, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[80]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[81]  Pradip Bose,et al.  Evaluating design tradeoffs in on-chip power management for CMPs , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[82]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[83]  Daniel Sánchez,et al.  Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[84]  Vanish Talwar,et al.  Power Management of Datacenter Workloads Using Per-Core Power Gating , 2009, IEEE Computer Architecture Letters.

[85]  Robin D. Burke,et al.  Hybrid Recommender Systems: Survey and Experiments , 2002, User Modeling and User-Adapted Interaction.

[86]  Christopher Meek,et al.  A unified approach to building hybrid recommender systems , 2009, RecSys '09.

[87]  Lieven Eeckhout,et al.  Fairness-aware scheduling on single-ISA heterogeneous multi-cores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[88]  Bryan A. Tolson,et al.  Dynamically dimensioned search algorithm for computationally efficient watershed model calibration , 2007 .

[89]  Krzysztof C. Kiwiel,et al.  Convergence and efficiency of subgradient methods for quasiconvex minimization , 2001, Math. Program..

[90]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[91]  Rajiv Nishtala,et al.  Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[92]  Stacey Jeffery,et al.  HASS: a scheduler for heterogeneous multicore systems , 2009, OPSR.

[93]  Christine A. Shoemaker,et al.  A Stochastic Radial Basis Function Method for the Global Optimization of Expensive Functions , 2007, INFORMS J. Comput..

[94]  Juan Carlos Saez,et al.  ACFS: a completely fair scheduler for asymmetric single-isa multicore systems , 2015, SAC.