RCS: Runtime resource and core scaling for power-constrained multi-core processors

Providing a sufficient voltage/frequency (V/F) scaling range is critical for effective power management. However, it has been fraught with decreasing nominal operating voltage and increasing manufacturing process variability that makes it harder to scale the minimum operating voltage (VMIN). In this paper, we first present a resource and core scaling (RCS) technique that jointly scales (i) the resources of a processor and (ii) the number of operating cores to maximize the performance of power-constrained multi-core processors. More specifically, we uniformly scale the resources that are both associated with each core (e.g., L1 caches and execution units (EUs)) and shared by all the cores (e.g., last-level cache (LLC)) as a means to compensate for lack of a V/F scaling range. Under the maximum power constraint, disabling some resources allows us to increase the number of operating cores, and vice versa. We demonstrate that the best RCS configuration for a given application can improve the geometric-mean performance by 21%. Second, we propose a runtime system that predicts the best RCS configuration for a given application and adapts the processor configuration accordingly at runtime. The runtime system only needs to examine a small fraction of runtime to predict the best RCS configuration with accuracy well over 90%, whereas the runtime overhead of prediction and adaptation is small. Finally, we propose to selectively scale the resources in RCS (dubbed sRCS) depending on application's characteristics and demonstrate that sRCS can offer 6% higher geometric-mean performance than RCS that uniformly scales the resources.

[1]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[2]  Maurice Steinman,et al.  AMD'S "LLANO" Fusion APU , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[3]  Kai Ma,et al.  Adaptive Power Control with Online Model Estimation for Chip Multiprocessors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[5]  Milos D. Ercegovac,et al.  The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Brad Calder,et al.  Transition phase classification and prediction , 2005, 11th International Symposium on High-Performance Computer Architecture.

[7]  Bruce R. Childers,et al.  Using utility prediction models to dynamically choose program thread counts , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[8]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[9]  Kai Ma,et al.  Scalable power control for many-core architectures running multi-threaded applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[10]  Christine A. Shoemaker,et al.  Flicker: a dynamically adaptive architecture for power limited multicore systems , 2013, ISCA.

[11]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  S. Borkar,et al.  Dynamic-sleep transistor and body bias for active leakage power control of microprocessors , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[13]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[14]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[15]  Dean M. Tullsen,et al.  Reducing peak power with a table-driven adaptive processor core , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Engin Ipek,et al.  Dynamic Multicore Resource Management: A Machine Learning Approach , 2009, IEEE Micro.

[17]  T. N. Vijaykumar,et al.  Reducing Design Complexity of the Load/Store Queue , 2003, MICRO.

[18]  Ying Qian,et al.  Performance characteristics of openMP constructs, and application benchmarks on a large symmetric multiprocessor , 2003, ICS '03.

[19]  Shih-Chieh Chang,et al.  An Efficient Wake-Up Strategy Considering Spurious Glitches Phenomenon for Power Gating Designs , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[20]  Dean M. Tullsen,et al.  Fast switching of threads between cores , 2009, OPSR.

[21]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[22]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[23]  Houman Homayoun,et al.  Dynamically heterogeneous cores through 3D resource pooling , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[24]  Sherief Reda,et al.  Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26]  Li Shang,et al.  Multi-Optimization power management for chip multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Engin Ipek,et al.  Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[28]  Nam Sung Kim,et al.  Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[29]  David M. Brooks,et al.  Efficiency trends and limits from comprehensive microarchitectural adaptivity , 2008, ASPLOS.

[30]  Margaret Martonosi,et al.  Long-term workload phases: duration predictions and applications to DVFS , 2005, IEEE Micro.

[31]  David A. Wood,et al.  WiDGET: Wisconsin decoupled grid execution tiles , 2010, ISCA.

[32]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[33]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[34]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[36]  Michael C. Huang,et al.  Dynamically Tuning Processor Resources with Adaptive Processing , 2003, Computer.

[37]  Alon Naveh,et al.  Power and Thermal Management in the Intel Core Duo Processor , 2006 .

[38]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[39]  Diana Marculescu,et al.  Microarchitecture-level power management , 2002, IEEE Trans. Very Large Scale Integr. Syst..