Trends in On-chip Dynamic Resource Management

The Complexity of emerging multi/many-core architectures and diversity of modern workloads demands coordinated dynamic resource management methods. We introduce a classification for these methods capturing the utilized resources and metrics. In this work, we use this classification to survey the key efforts in dynamic resource management. We first cover heuristic and optimization methods used to manage resources such as power, energy, temperature, Quality-of-Service (QoS) and reliability of the system. We then identify some of the machine learning based methods used in tuning architectural parameters in computer systems. In many cases, resource managers need to enforce design constraints during runtime with a certain level of guarantee. Hence, we also study the trend in deploying formal control theoretic approaches in order to achieve efficient and robust dynamic resource management.

[1]  Tajana Simunic,et al.  Temperature Aware Task Scheduling in MPSoCs , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Li-Shiuan Peh,et al.  CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs , 2011, J. Parallel Distributed Comput..

[3]  Tajana Simunic,et al.  Energy efficient proactive thermal management in memory subsystem , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[4]  Sudeep Pasricha,et al.  VARSHA: Variation and reliability-aware application scheduling with adaptive parallelism in the dark-silicon era , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Krste Asanovic,et al.  Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[6]  Axel Jantsch,et al.  Reliability-Aware Runtime Power Management for Many-Core Systems in the Dark Silicon Era , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Axel Jantsch,et al.  Approximation knob: Power Capping meets energy efficiency , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[8]  Axel Jantsch,et al.  SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management , 2018, ASPLOS.

[9]  Hannu Tenhunen,et al.  Performance/Reliability-Aware Resource Management for Many-Cores in Dark Silicon Era , 2017, IEEE Transactions on Computers.

[10]  Yu Zhang,et al.  Intelligent Cloud Resource Management with Deep Reinforcement Learning , 2018, IEEE Cloud Computing.

[11]  Hyeonsang Eom,et al.  OMBM: Optimized Memory Bandwidth Management for Ensuring QoS and High Server Utilization , 2017, 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W).

[12]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[13]  Rolf Ernst,et al.  Efficient throughput-guarantees for latency-sensitive networks-on-chip , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[14]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[15]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[16]  David Blaauw,et al.  Multi-Mechanism Reliability Modeling and Management in Dynamic Systems , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Lei He,et al.  Temperature and supply Voltage aware performance and power modeling at microarchitecture level , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Sheldon X.-D. Tan,et al.  Learning-based dynamic reliability management for dark silicon processor considering EM effects , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Kai Ma,et al.  PGCapping: Exploiting power gating for power capping and core lifetime balancing in CMPs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  TingTing Hwang,et al.  Thermal-aware dynamic page allocation policy by future access patterns for Hybrid Memory Cube (HMC) , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[22]  Chrysostomos Nicopoulos,et al.  Exploring System Availability During Software-Based Self-Testing of Multi-core CPUs , 2018, J. Electron. Test..

[23]  Jürgen Becker,et al.  A Scalable NoC Router Design Providing QoS Support Using Weighted Round Robin Scheduling , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[24]  Roman L. Lysecky,et al.  Workload assignment considering NBTI degradation in multicore systems , 2014, ACM J. Emerg. Technol. Comput. Syst..

[25]  Edwin V. Bonilla,et al.  Dynamic microarchitectural adaptation using machine learning , 2013, ACM Trans. Archit. Code Optim..

[26]  Asit K. Mishra,et al.  METE: meeting end-to-end QoS in multicores through system-wide resource management , 2011, PERV.

[27]  Keshav Pingali,et al.  Proactive Control of Approximate Programs , 2016, ASPLOS.

[28]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[29]  Onur Mutlu,et al.  Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[30]  Qiang Xu,et al.  ApproxQA: A unified quality assurance framework for approximate computing , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[31]  Kun-Chih Chen,et al.  Dynamic Buffer Allocation for thermal-aware 3D network-on-chip systems , 2017, 2017 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-TW).

[32]  Xiaobo Sharon Hu,et al.  Improving System-Level Lifetime Reliability of Multicore Soft Real-Time Systems , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  Sudhakar Yalamanchili,et al.  Managing performance-reliability tradeoffs in multicore processors , 2015, 2015 IEEE International Reliability Physics Symposium.

[34]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[35]  Thierry Moreau,et al.  Exploiting quality-energy tradeoffs with arbitrary quantization: special session paper , 2017, CODES+ISSS.

[36]  Xiaoyun Zhu,et al.  Designing Controllable Computer Systems , 2005, HotOS.

[37]  Kaustav Banerjee,et al.  Analysis of substrate thermal gradient effects on optimal buffer insertion , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[38]  Qiang Xu,et al.  On quality trade-off control for approximate computing using iterative training , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  Cristiana Bolchini,et al.  Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems , 2013, J. Electron. Test..

[40]  裕幸 飯田,et al.  International Technology Roadmap for Semiconductors 2003の要求清浄度について - シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について - , 2004 .

[41]  Chong-Min Kyung,et al.  Runtime Thermal Management for 3-D Chip-Multiprocessors With Hybrid SRAM/MRAM L2 Cache , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[42]  Mimonah Al Qathrady,et al.  Proactive thermal management using memory based computing , 2013, 2013 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH).

[43]  Luca Benini,et al.  Policy optimization for dynamic power management , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[44]  Norman P. Jouppi,et al.  Enterprise IT trends and implications for architecture research , 2005, 11th International Symposium on High-Performance Computer Architecture.

[45]  Lieven Eeckhout,et al.  Scenario-Based Resource Prediction for QoS-Aware Media Processing , 2010, Computer.

[46]  Daniel Mossé,et al.  Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[47]  Axel Jantsch,et al.  Dark silicon aware runtime mapping for many-core systems: A patterning approach , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[48]  Engin Ipek,et al.  Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[49]  Kaushik Roy,et al.  A leakage control system for thermal stability during burn-in test , 2005, IEEE International Conference on Test, 2005..

[50]  Muhammad Shafique,et al.  Self-adaptive hybrid Dynamic Power Management for many-core systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[51]  Nikil D. Dutt,et al.  Dependability evaluation of SISO control-theoretic power managers for processor architectures , 2017, 2017 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC).

[52]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[53]  Yuan Xie,et al.  LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[54]  Shuvra S. Bhattacharyya,et al.  Power and Thermal Modeling for Communication Systems , 2016, 2016 IEEE International Workshop on Signal Processing Systems (SiPS).

[55]  Yan Solihin,et al.  QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[56]  Daniel Sánchez,et al.  Scaling distributed cache hierarchies through computation and data co-scheduling , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[57]  Amin Ansari,et al.  Using Multiple Input, Multiple Output Formal Control to Maximize Resource Efficiency in Architectures , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[58]  Hannu Tenhunen,et al.  A Power-Aware Approach for Online Test Scheduling in Many-Core Architectures , 2016, IEEE Transactions on Computers.

[59]  Scott A. Mahlke,et al.  Input responsiveness: using canary inputs to dynamically steer approximation , 2016, PLDI.

[60]  Lingjia Tang,et al.  Compiling for niceness: mitigating contention for QoS in warehouse scale computers , 2012, CGO '12.

[61]  Mike P. Papazoglou,et al.  Introduction: Service-oriented computing , 2003, CACM.

[62]  L. S. Nielsen,et al.  Low-power operation using self-timed circuits and adaptive scaling of the supply voltage , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[63]  Cristiana Bolchini,et al.  A dynamic reliability management framework for heterogeneous multicore systems , 2017, 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[64]  Lingjia Tang,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[65]  Ravi Iyer,et al.  Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[66]  Axel Jantsch,et al.  HDGM: Hierarchical Dynamic Goal Management for Many-Core Resource Allocation , 2018, IEEE Embedded Systems Letters.

[67]  Scott A. Mahlke,et al.  Rumba: An online quality management system for approximate computing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[68]  Henry Hoffmann,et al.  CALOREE: Learning Control for Predictable Latency and Low Energy , 2018, ASPLOS.

[69]  Tajana Simunic,et al.  Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors , 2009, SIGMETRICS '09.

[70]  Xiao Zhang,et al.  Towards practical page coloring-based multicore cache management , 2009, EuroSys '09.

[71]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[72]  Yu Wang,et al.  Run-time technique for simultaneous aging and power optimization in GPGPUs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[73]  Margaret Martonosi,et al.  Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[74]  Henry Hoffmann,et al.  CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[75]  Li-Shiuan Peh,et al.  Dynamic QoS management for chip multiprocessors , 2012, TACO.

[76]  Axel Jantsch,et al.  Design Methodology for Responsive and Rrobust MIMO Control of Heterogeneous Multicores , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[77]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[78]  Ying Ye,et al.  COLORIS: A dynamic cache partitioning system using page coloring , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[79]  Sudeep Pasricha,et al.  Soft and Hard Reliability-Aware Scheduling for Multicore Embedded Systems with Energy Harvesting , 2015, IEEE Transactions on Multi-Scale Computing Systems.

[80]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[81]  Luca Benini,et al.  Dynamic power management for nonstationary service requests , 1999, Design, Automation and Test in Europe Conference and Exhibition, 1999. Proceedings (Cat. No. PR00078).

[82]  Hannu Tenhunen,et al.  A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[83]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[84]  Sasa Misailovic,et al.  Quality of service profiling Citation , 2010 .

[85]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[86]  Lothar Thiele,et al.  Adaptive Dynamic Power Management for Hard Real-Time Systems , 2009, 2009 30th IEEE Real-Time Systems Symposium.

[87]  Axel Jantsch,et al.  Dynamic power management for many-core platforms in the dark silicon era: A multi-objective control approach , 2015, 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[88]  Muhammad Shafique,et al.  Aging-Aware Workload Management on Embedded GPU Under Process Variation , 2018, IEEE Transactions on Computers.

[89]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[90]  James Donald,et al.  Leveraging Simultaneous Multithreading for Adaptive Thermal Control , 2005 .

[91]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[92]  Glenn Reinman,et al.  BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[93]  Axel Jantsch,et al.  MapPro: Proactive Runtime Mapping for Dynamic Workloads by Quantifying Ripple Effect of Applications on Networks-on-Chip , 2015, NOCS.

[94]  Radu Marculescu,et al.  Dynamic power management for multidomain system-on-chip platforms , 2013, ACM Trans. Design Autom. Electr. Syst..

[95]  Rolf Ernst,et al.  Back Suction: Service Guarantees for Latency-Sensitive On-chip Networks , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[96]  Hadi Esmaeilzadeh,et al.  Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[97]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[98]  Axel Jantsch,et al.  adBoost: Thermal Aware Performance Boosting Through Dark Silicon Patterning , 2018, IEEE Transactions on Computers.

[99]  Mahmut T. Kandemir,et al.  QoS aware dynamic time-slice tuning , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[100]  Jana Kosecka,et al.  Control of Discrete Event Systems , 1992 .