论文信息 - Measuring and modeling on-chip interconnect power on real hardware

Measuring and modeling on-chip interconnect power on real hardware

On-chip data movement is a major source of power consumption in modern processors, and future technology nodes will exacerbate this problem. Properly understanding the power that applications expend moving data is vital for inventing mitigation strategies. Previous studies combined data movement energy, which is required to move information across the chip, with data access energy, which is used to read or write onchip memories. This combination can hide the severity of the problem, as memories and interconnects will scale differently to future technology nodes. Thus, increasing the fidelity of our energy measurements is of paramount concern. We propose to use physical data movement distance as a mechanism for separating movement energy from access energy. We then use this mechanism to design microbenchmarks to ascertain data movement energy on a real modern processor. Using these microbenchmarks, we study the following parameters that affect interconnect power: (i) distance, (ii) interconnect bandwidth, (iii) toggle rate, and (iv) voltage and frequency. We conduct our study on an AMD GPU built in 28nm technology and validate our results against industrial estimates for energy/bit/millimeter. We then construct an empirical model based on our characterization and use it to evaluate the interconnect power of 22 real-world applications. We show that up to 14% of the dynamic power in some applications can be consumed by the interconnect and present a range of mitigation strategies.

[1] Dimitrios S. Nikolopoulos,et al. BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[2] Sharad Malik,et al. Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[3] Babak Falsafi,et al. Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[4] Wei Wu,et al. A systematic method for functional unit power estimation in microprocessors , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[5] Rajeev Balasubramonian,et al. Non-uniform power access in large caches with low-swing wires , 2009, 2009 International Conference on High Performance Computing (HiPC).

[6] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[7] Shrirang M. Yardi,et al. CAMP: A technique to estimate per-structure power at run-time using a few simple parameters , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8] Alok Choudhary,et al. Synergistic Challenges in Data-Intensive Science and Exascale Computing: DOE ASCAC Data Subcommittee Report , 2013 .

[9] Eduard Ayguadé,et al. Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[10] Chen Sun,et al. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[11] Andrew B. Kahng,et al. ORION3.0: A Comprehensive NoC Router Estimation Tool , 2015, IEEE Embedded Systems Letters.

[12] Gokcen Kestor,et al. Enabling accurate power profiling of HPC applications on exascale systems , 2013, ROSS '13.

[13] Sudhakar Yalamanchili,et al. Harmonia: Balancing compute and memory power in high-performance GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[14] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15] Brian W. Barrett,et al. Introducing the Graph 500 , 2010 .

[16] Carole-Jean Wu,et al. Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[17] Gilberto Contreras,et al. Power prediction for Intel XScale processors using performance monitoring unit events , 2005 .

[18] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.

[19] Jason Helge Anderson,et al. Switching activity analysis and pre-layout activity prediction for FPGAs , 2003, SLIP '03.

[20] Jing Zhang,et al. OpenCL and the 13 dwarfs: a work in progress , 2012, ICPE '12.

[21] Margaret Martonosi,et al. Power prediction for Intel XScale/spl reg/ processors using performance monitoring unit events , 2005, ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005..

[22] Gokcen Kestor,et al. Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[23] Andrew B. Kahng,et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[24] Shekhar Y. Borkar. Exascale Computing - A Fact or a Fiction? , 2013, IPDPS.

[25] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[26] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[27] J. Koomey. Worldwide electricity used in data centers , 2008 .

[28] Ian Karlin,et al. LULESH 2.0 Updates and Changes , 2013 .

[29] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .

[30] Onur Mutlu,et al. Toggle-Aware Compression for GPUs , 2015, IEEE Computer Architecture Letters.

[31] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[32] Kevin Skadron,et al. Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[33] Andrew Siegel,et al. XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .