An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores

Heterogeneous manycore architectures have shown to be highly promising to boost power efficiency through two independent ways: (1) enabling massive thread-level parallelism, called “scale-out” approach, and (2) enabling thread migration between heterogeneous cores, called “scale-up” approach. How to accurately model the profitability of power efficiency of the two ways, particularly in an analytical and computational-effective manner, is essential to reap the power efficiency of such architectures. We propose a comprehensive analytical model to predict the power efficiency from the two independent ways. Given power efficiency is measured by performance per watt, this model is composed of a performance and a power model. The performance model is built by two orthogonal functions a and β. Function a describes the scale-out speedup from multithreading; function β presents the scale-up speedup from core heterogeneity. Thus, the performance model can clearly capture the overall speedup of any multithreading and thread-to-core mapping strategies. The power model predicts the power of corresponding scale-out and scale-up configurations. It simultaneously captures the power variations caused by thread synchronization and thread migration between heterogeneous cores. We build both performance and power model in an analytical way and keep the computational complexity in mind. This merit leads to a suit of comprehensive and low-complexity models for runtime management. These models are validated on large-scale heterogeneous manycore architecture with full-system simulations. For performance prediction, the average error is below 12 percent, lower than that of the state-of-the-art methods. For power prediction, the average error is 7.74 percent. On top of the models, we introduce two heuristic scheduling algorithms, performance-oriented MAX-P and power efficiency-oriented MAX-E, to demonstrate the usage of these models. The results show that MAX-P outperforms the state-of-the-art methods by 18 percent in performance averagely; MAX-E outperforms the baseline by 70 percent in power efficiency on average.

[1]  James E. Smith,et al.  A performance counter architecture for computing accurate CPI components , 2006, ASPLOS XII.

[2]  Kevin Skadron,et al.  Implications of the Power Wall: Dim Cores and Reconfigurable Logic , 2013, IEEE Micro.

[3]  Minyi Guo,et al.  AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[4]  Pradip Bose,et al.  Abstraction and microarchitecture scaling in early-stage power modeling , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[5]  Onur Mutlu,et al.  Utility-based acceleration of multithreaded applications on asymmetric CMPs , 2013, ISCA.

[6]  Margaret Martonosi,et al.  Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[7]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[8]  Lu Peng,et al.  Lighting the dark silicon by exploiting heterogeneity on future processors , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Alain J. Martin,et al.  ET 2 : a metric for time and energy efficiency of computation , 2002 .

[10]  Norman P. Jouppi,et al.  The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance , 1989, IEEE Trans. Computers.

[11]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[12]  Kai Ma,et al.  Temperature-constrained power control for chip multiprocessors with online model estimation , 2009, ISCA '09.

[13]  Hsien-Hsin S. Lee,et al.  Supporting cache coherence in heterogeneous multiprocessor systems , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[14]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[15]  Xiaowei Li,et al.  ShuttleNoC: Boosting on-chip communication efficiency by enabling localized power adaptation , 2015, The 20th Asia and South Pacific Design Automation Conference.

[16]  Anand Sivasubramaniam,et al.  Virtualizing power distribution in datacenters , 2013, ISCA.

[17]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[18]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Benjamin C. Lee,et al.  Navigating heterogeneous processors with market mechanisms , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[20]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[21]  Xiaowei Li,et al.  Amphisbaena: Modeling two orthogonal ways to hunt on heterogeneous many-cores , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[22]  David M. Brooks,et al.  CPR: Composable performance regression for scalable multiprocessor models , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[23]  Stacey Jeffery,et al.  HASS: a scheduler for heterogeneous multicore systems , 2009, OPSR.

[24]  Ramon Canal,et al.  Design space exploration for multicore architectures: a power/performance/thermal view , 2006, ICS '06.

[25]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[26]  Manuel Prieto,et al.  Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems , 2012, TOCS.

[27]  David Brooks,et al.  Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors , 2012 .

[28]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[29]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[30]  Meikang Qiu,et al.  Extending Amdahl’s law and Gustafson’s law by evaluating interconnections on multi-core processors , 2013, The Journal of Supercomputing.

[31]  Uri C. Weiser,et al.  Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors , 2006, IEEE Computer Architecture Letters.

[32]  Michael D. Smith,et al.  Voltage Noise in Production Processors , 2011, IEEE Micro.

[33]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[34]  Xiaowei Li,et al.  MicroFix: exploiting path-grained timing adaptability for improving power-performance efficiency , 2009, ISLPED.

[35]  John Paul Shen,et al.  Theoretical modeling of superscalar processor performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  Ripal Nathuji,et al.  Exploiting Platform Heterogeneity for Power Efficient Data Centers , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[37]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Christine A. Shoemaker,et al.  Scalable thread scheduling and global power management for heterogeneous many-core architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[40]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[41]  Thomas F. Wenisch,et al.  Peak power modeling for data center servers with switched-mode power supplies , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[42]  Li-Shiuan Peh,et al.  Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Josep Torrellas,et al.  Coping with Parametric Variation at Near-Threshold Voltages , 2013, IEEE Micro.

[44]  Mark Horowitz,et al.  Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[45]  Kai Ma,et al.  Scalable power control for many-core architectures running multi-threaded applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[46]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[47]  Lizy Kurian John,et al.  MAximum Multicore POwer (MAMPO) — An automatic multithreaded synthetic power virus generation framework for multicore systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[48]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[49]  Seung Ryoul Maeng,et al.  Virtualizing performance asymmetric multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[50]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[51]  Rajarshi Das,et al.  Expressive Power-Based Resource Allocation for Data Centers , 2009, IJCAI.

[52]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.