Understanding the Future of Energy Efficiency in Multi-Module GPUs

As Moore’s law slows down, GPUs must pivot towards multi-module designs to continue scaling performance at historical rates. Prior work on multi-module GPUs has focused on performance, while largely ignoring the issue of energy efficiency. In this work, we propose a new metric for GPU efficiency called EDP Scaling Efficiency that quantifies the effects of both strong performance scaling and overall energy efficiency in these designs. To enable this analysis, we develop a novel top-down GPU energy estimation framework that is accurate within 10% of a recent GPU design. Being decoupled from granular GPU microarchitectural details, the framework is appropriate for energy efficiency studies in future GPUs. Using this model in conjunction with performance simulation, we show that the dominating factor influencing the energy efficiency of GPUs over the next decade is GPUmodule (GPM) idle time. Furthermore, neither inter-module interconnect energy, nor GPM microarchitectural design is expected to play a key role in this regard. We demonstrate that multi-module GPUs are on a trajectory to become 2⇥ less energy efficient than current monolithic designs; a significant issue for data centers which are already energy constrained. Finally, we show that architects must be willing to spend more (not less) energy to enable higher bandwidth inter-GPM connections, because counter-intuitively, this additional energy expenditure can reduce total GPU energy consumption by as much as 45%, providing a path to energy efficient strong scaling in the future.

[1]  Scott A. Mahlke,et al.  APOGEE: Adaptive prefetching on GPUs for energy efficiency , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[2]  K.M. Wilson,et al.  Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[3]  Shuaiwen Song,et al.  A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Matthew Poremba,et al.  Design and Analysis of an APU for Exascale Computing , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Natalie D. Enright Jerger,et al.  Enabling interposer-based disintegration of multi-core processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Mahmut T. Kandemir,et al.  μC-States: Fine-grained GPU datapath power management , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[8]  Carole-Jean Wu,et al.  MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Karthikeyan Sankaralingam,et al.  Your favorite simulator here " Considered Harmful , 2014 .

[10]  Xi Chen,et al.  A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications , 2013, IEEE Journal of Solid-State Circuits.

[11]  Sudhakar Yalamanchili,et al.  Power Modeling for GPU Architectures Using McPAT , 2014, TODE.

[12]  Indrani Paul,et al.  Dynamic GPGPU Power Management Using Adaptive Model Predictive Control , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[14]  Joonyoung Kim,et al.  HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[15]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Mahmut T. Kandemir,et al.  A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17]  Amnon Barak,et al.  Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Daniel A. Jiménez,et al.  Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.

[20]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Mattan Erez,et al.  Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures , 2016, ISCA.

[22]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[23]  David M. Brooks,et al.  Energy characterization and instruction-level energy model of Intel's Xeon Phi processor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[24]  Onur Mutlu,et al.  A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[26]  Gokcen Kestor,et al.  Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Long Chen,et al.  Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[28]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[29]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[30]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[31]  Scott A. Mahlke,et al.  Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2014, Sci. Program..

[33]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[34]  Carole-Jean Wu,et al.  ID-cache: instruction and memory divergence based cache management for GPUs , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[35]  Wu-chun Feng,et al.  Measuring and modeling on-chip interconnect power on real hardware , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[36]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Aamer Jaleel,et al.  Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Nuno Roma,et al.  GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[39]  Mohammad Abdel-Majeed,et al.  Warped gates: Gating aware scheduling and power gating for GPGPUs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[41]  Henry Hoffmann,et al.  GRAPE: Minimizing energy for GPU applications with performance requirements , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Carole-Jean Wu,et al.  Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[43]  Hiroshi Sasaki,et al.  Power and Performance Characterization and Modeling of GPU-Accelerated Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[44]  Won Woo Ro,et al.  Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[45]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[47]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[48]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[49]  Carole-Jean Wu,et al.  LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[50]  Hideharu Amano,et al.  Breadth First Search on Cost-efficient Multi-GPU Systems , 2016, CARN.

[51]  Onur Mutlu,et al.  The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality In GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[52]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[53]  Carole-Jean Wu,et al.  Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[54]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.