DVFS Space Exploration in Power Constrained Processing-in-Memory Systems

In order to deliver high performance under stringent power constraints, future systems may include die-stacked memories with processing-in-memory (PIM) cores. Because of their proximity to the memory, PIMs are expected to target applications which require high bandwidth, implying that PIMs do not need the same computational capabilities as traditional host processor and can therefore be implemented using slower, low-leakage transistors to increase energy efficiency. Such systems must carefully balance design-time choices, such as the circuits used to build the devices, and run-time choices, such as DVFS states and the preferred hardware platform on which to run the application. This paper explores these parameters in a GPGPU PIM system with a large compute-optimized host and a collection of bandwidth-optimized PIMs. We develop high-level performance and power models and use them to find optimal DVFS and kernel placement decisions for a series of GPGPU applications targeting maximum energy efficiency. We find, for instance, that the energy efficiency of PIM systems is greatly affected by DVFS; simply selecting the optimum hardware (host/PIM) results in 7\(\times \) higher ED\(^2\) than migrating work in conjunction with DVFS.

[1]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Sherief Reda,et al.  Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Indrani Paul,et al.  Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.

[4]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[5]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Dan Bouvier,et al.  Energy efficient graphics and multimedia in 28NM Carrizo APU , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[7]  Yuan Xie,et al.  Die Stacking Is Happening , 2018, IEEE Micro.

[8]  Karthikeyan Sankaralingam,et al.  Your favorite simulator here " Considered Harmful , 2014 .

[9]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[10]  Li Shen,et al.  PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Gabriel H. Loh,et al.  Thermal Feasibility of Die-Stacked Processing in Memory , 2014 .

[12]  Scott A. Mahlke,et al.  Heterogeneous microarchitectures trump voltage scaling for low-power cores , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[13]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[14]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[15]  Wen-mei W. Hwu,et al.  Heterogeneous System Architecture: A New Compute Platform Infrastructure , 2015 .

[16]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[17]  Sarma B. K. Vrudhula,et al.  Energy-Efficient Operation of Multicore Processors by DVFS, Task Migration, and Active Cooling , 2014, IEEE Transactions on Computers.

[18]  Krishna M. Kavi,et al.  Processing-in-Memory: Exploring the Design Space , 2015, ARCS.

[19]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[20]  Indrani Paul,et al.  A Taxonomy of GPGPU Performance Scaling , 2015, 2015 IEEE International Symposium on Workload Characterization.

[21]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[22]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Krishna M. Kavi,et al.  Improving Node-Level MapReduce Performance Using Processing-in-Memory Technologies , 2014, Euro-Par Workshops.

[25]  Ronald G. Dreslinski,et al.  Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[26]  Lieven Eeckhout,et al.  DVFS performance prediction for managed multithreaded applications , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[27]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[28]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[29]  Elad Alon,et al.  Per-Core DVFS With Switched-Capacitor Converters for Energy Efficiency in Manycore Processors , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.