Models and Techniques for Green High-Performance Computing

High-performance computing (HPC) systems have become power limited. For instance, the U.S. Department of Energy set a power envelope of 20 MW in 2008 for the first exascale supercomputer now expected to arrive in 2021–22. Toward this end, we seek to improve the greenness of HPC systems by improving their performance per watt at the allocated power budget. In this dissertation, we develop a series of models and techniques to manage power at micro-, meso-, and macro-levels of the system hierarchy, specifically addressing data movement and heterogeneity. We target the chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level. Overall, our goal is to improve the greenness of HPC systems by intelligently managing power. The first part of this dissertation focuses on measurement and modeling problems for power. First, we study how to infer chip-interconnect power by observing the system-wide power consumption. Our proposal is to design a novel micro-benchmarking methodology based on data-movement distance by which we can properly isolate the chip interconnect and measure its power. Next, we study how to develop software power meters to monitor a GPU’s power consumption at runtime. Our proposal is to adapt performance counter-based models for their use at runtime via a combination of heuristics, statistical techniques, and applicationspecific knowledge. In the second part of this dissertation, we focus on managing power. First, we propose to reduce the chip-interconnect power by proactively managing its dynamic voltage and frequency (DVFS) state. Toward this end, we develop a novel phase predictor that uses approximate pattern matching to forecast future requirements and in turn, proactively manage power. Second, we study the problem of applying a power cap to a heterogeneous node. Our proposal proactively manages the GPU power using phase prediction and a DVFS power model but reactively manages the CPU. The resulting hybrid approach can take advantage of the differences in the capabilities of the two devices. Third, we study how in-situ techniques can be applied to improve the greenness of HPC clusters. Overall, in our dissertation, we demonstrate that it is possible to infer power consumption of real hardware components without directly measuring them, using the chip interconnect and GPU as examples. We also demonstrate that it is possible to build models of sufficient accuracy and apply them for intelligently managing power at many levels of the system hierarchy. Models and Techniques for Green High-Performance Computing Vignesh Adhinarayanan (GENERAL AUDIENCE ABSTRACT) Past research in green high-performance computing (HPC) mostly focused on managing the power consumed by general-purpose processors, known as central processing units (CPUs) and to a lesser extent, memory. In this dissertation, we study two increasingly important components: interconnects (predominantly focused on those inside a chip, but not limited to them) and graphics processing units (GPUs). Our contributions in this dissertation include a set of innovative measurement techniques to estimate the power consumed by the target components, statistical and analytical approaches to develop power models and their optimizations, and algorithms to manage power statically and at runtime. Experimental results show that it is possible to build models of sufficient accuracy and apply them for intelligently managing power on multiple levels of the system hierarchy: chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level. To my parents, Maheswari and Adhinarayanan, and my sister, Kavipriya.

[1]  Dimitrios S. Nikolopoulos,et al.  BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[2]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Martin Burtscher,et al.  Measuring GPU Power with the K20 Built-in Sensor , 2014, GPGPU@ASPLOS.

[4]  Xi Chen,et al.  Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Martin Schulz,et al.  A Run-Time System for Power-Constrained HPC Applications , 2015, ISC.

[6]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Uri C. Weiser,et al.  Interconnect-power dissipation in a microprocessor , 2004, SLIP '04.

[8]  Wu-chun Feng,et al.  Making a Case for Green High-Performance Visualization Via Embedded Graphics Processors , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[9]  Wu-chun Feng,et al.  An automated framework for characterizing and subsetting GPGPU workloads , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Sudhakar Yalamanchili,et al.  Harmonia: Balancing compute and memory power in high-performance GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[11]  Michael Lang,et al.  Power usage of production supercomputers and production workloads , 2016, Concurr. Comput. Pract. Exp..

[12]  Martin Schulz,et al.  I/O Aware Power Shifting , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[13]  Shuaiwen Song,et al.  A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Matthew Poremba,et al.  Design and Analysis of an APU for Exascale Computing , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Indrani Paul,et al.  Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.

[16]  Trevor N. Mudge,et al.  Analysis of branch prediction via data compression , 1996, ASPLOS VII.

[17]  Philip W. Jones,et al.  A multi-resolution approach to global ocean modeling , 2013 .

[18]  Martin Schulz,et al.  Exploring hardware overprovisioning in power-constrained, high performance computing , 2013, ICS '13.

[19]  Martin Schulz,et al.  Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[20]  Paolo Giaccone,et al.  Rate-based vs delay-based control for DVFS in NoC , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Hans Hagen,et al.  In Situ Eddy Analysis in a High-Resolution Ocean Climate Model , 2016, IEEE Transactions on Visualization and Computer Graphics.

[22]  Carole-Jean Wu,et al.  Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Wu-chun Feng,et al.  ETH: A Framework for the Design-Space Exploration of Extreme-Scale Visualization , 2017 .

[24]  Gilberto Contreras,et al.  Power prediction for Intel XScale processors using performance monitoring unit events , 2005 .

[25]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[26]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[27]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[28]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[29]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[30]  Margaret Martonosi,et al.  Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[31]  Daniel Bedard,et al.  PowerMon: Fine-grained and integrated power monitoring for commodity computer systems , 2010, Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon).

[32]  Tao Chen,et al.  Execution time prediction for energy-efficient hardware accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Xi Chen,et al.  Dynamic voltage and frequency scaling for shared resources in multicore processor designs , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Xiaohan Ma,et al.  Statistical Power Consumption Analysis and Modeling for GPU-based Computing , 2011 .

[36]  Laxmikant V. Kalé,et al.  Optimizing power allocation to CPU and memory subsystems in overprovisioned HPC systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[37]  Wu-chun Feng,et al.  GPU power prediction via ensemble machine learning for DVFS space exploration , 2018, CF.

[38]  Wu-chun Feng,et al.  Online Power Estimation of Graphics Processing Units , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[39]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[40]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[41]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[42]  Rajeev Balasubramonian,et al.  Non-uniform power access in large caches with low-swing wires , 2009, 2009 International Conference on High Performance Computing (HiPC).

[43]  Yuan Yao,et al.  DVFS for NoCs in CMPs: A thread voting approach , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[44]  Satoshi Matsuoka,et al.  Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[45]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[46]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[47]  Denis C. Daly,et al.  Through the Looking Glass - The 2018 Edition: Trends in Solid-State Circuits from the 65th ISSCC , 2018, IEEE Solid-State Circuits Magazine.

[48]  Mahmut T. Kandemir,et al.  Phase Detection with Hidden Markov Models for DVFS on Many-Core Processors , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[49]  Wu-chun Feng,et al.  On the Greenness of In-Situ and Post-Processing Visualization Pipelines , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[50]  Jason Helge Anderson,et al.  Switching activity analysis and pre-layout activity prediction for FPGAs , 2003, SLIP '03.

[51]  Vilayanur S. Ramachandran,et al.  Filling in Gaps in Perception: Part I , 1992 .

[52]  Shrirang M. Yardi,et al.  CAMP: A technique to estimate per-structure power at run-time using a few simple parameters , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[53]  Christoforos E. Kozyrakis,et al.  A Comparison of High-Level Full-System Power Models , 2008, HotPower.

[54]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[55]  Martin Schulz,et al.  Practical Resource Management in Power-Constrained, High Performance Computing , 2015, HPDC.

[56]  Laxmikant V. Kalé,et al.  Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[57]  Wu-Chun Feng,et al.  Telescoping Architectures: Evaluating Next-Generation Heterogeneous Computing , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[58]  Natalie D. Enright Jerger,et al.  Improving DVFS in NoCs with Coherence Prediction , 2015, NOCS.

[59]  Indrani Paul,et al.  Dynamic GPGPU Power Management Using Adaptive Model Predictive Control , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[60]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[61]  Radu Marculescu,et al.  Variation-adaptive feedback control for networks-on-chip with multiple clock domains , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[62]  Carole-Jean Wu,et al.  MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[63]  Frank Mueller,et al.  PShifter: feedback-based dynamic power shifting within HPC jobs for performance , 2018, HPDC.

[64]  Alok Choudhary,et al.  Synergistic Challenges in Data-Intensive Science and Exascale Computing: DOE ASCAC Data Subcommittee Report , 2013 .

[65]  Margaret Martonosi,et al.  Phase characterization for power: evaluating control-flow-based and event-counter-based techniques , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[66]  Frank Mueller,et al.  Power tuning HPC jobs on power-constrained systems , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[67]  Wu-chun Feng,et al.  The Right Metric for Efficient Supercomputing: A Ten-Year Retrospective , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[68]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[69]  Onur Mutlu,et al.  Toggle-Aware Compression for GPUs , 2015, IEEE Computer Architecture Letters.

[70]  John D. Davis,et al.  CHAOS: Composable Highly Accurate OS-based power models , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[71]  Nuno Roma,et al.  GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[72]  Natalie D. Enright Jerger,et al.  NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[73]  James H. Laros,et al.  PowerInsight - A commodity power measurement capability , 2013, 2013 International Green Computing Conference Proceedings.

[74]  Bronis R. de Supinski,et al.  CoreTSAR: Core Task-Size Adapting Runtime , 2015, IEEE Transactions on Parallel and Distributed Systems.

[75]  Li Shen,et al.  PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[76]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[77]  Samuel Naffziger,et al.  Multi-chip technologies to unleash computing performance gains over the next decade , 2017, 2017 IEEE International Electron Devices Meeting (IEDM).

[78]  K. Ramani,et al.  PowerRed : A Flexible Modeling Framework for Power Efficiency Exploration in GPUs , .

[79]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[80]  Stephen W. Poole,et al.  Power measurement for high performance computing: State of the art , 2011, 2011 International Green Computing Conference and Workshops.

[81]  Xiaorui Wang,et al.  Power capping: a prelude to power shifting , 2008, Cluster Computing.

[82]  Margaret Martonosi,et al.  Long-term workload phases: duration predictions and applications to DVFS , 2005, IEEE Micro.

[83]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[84]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[85]  Sriram R. Vangal,et al.  A 2 Tb/s 6 × 4 Mesh Network for a Single-Chip Cloud Computer With DVFS in 45 nm CMOS , 2011, VLSIC 2011.

[86]  Sally A. McKee,et al.  Real time power estimation and thread scheduling via performance counters , 2009, CARN.

[87]  Martin Schulz,et al.  Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems , 2014, 2014 43rd International Conference on Parallel Processing.

[88]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[89]  Andrew B. Kahng,et al.  ORION3.0: A Comprehensive NoC Router Estimation Tool , 2015, IEEE Embedded Systems Letters.

[90]  Scott Pakin,et al.  Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[91]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[92]  Xi Chen,et al.  In-network Monitoring and Control Policy for DVFS of CMP Networks-on-Chip and Last Level Caches , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[93]  William J. Dally,et al.  Scaling the Power Wall: A Path to Exascale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[94]  Eduard Ayguadé,et al.  Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[95]  Frank Bellosa,et al.  The benefits of event: driven energy accounting in power-sensitive systems , 2000, ACM SIGOPS European Workshop.

[96]  Martin Schulz,et al.  Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[97]  Sandhya Dwarkadas,et al.  Characterizing and predicting program behavior and its variability , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[98]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[99]  Mahmut T. Kandemir,et al.  Markov Model Based Disk Power Management for Data Intensive Workloads , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[100]  Ruhi Sarikaya,et al.  Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[101]  Wu-chun Feng,et al.  Measuring and modeling on-chip interconnect power on real hardware , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[102]  Li-Shiuan Peh,et al.  Dynamic power management for power optimization of interconnection networks using on/off links , 2003, 11th Symposium on High Performance Interconnects, 2003. Proceedings..

[103]  Jing Zhang,et al.  OpenCL and the 13 dwarfs: a work in progress , 2012, ICPE '12.

[104]  Li Shang,et al.  Power-efficient Interconnection Networks: Dynamic Voltage Scaling with Links , 2002, IEEE Computer Architecture Letters.

[105]  Henry Hoffmann,et al.  Maximizing Performance Under a Power Cap: A Comparison of Hardware, Software, and Hybrid Techniques , 2016, ASPLOS.

[106]  Chita R. Das,et al.  A case for dynamic frequency tuning in on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[107]  Hiroshi Sasaki,et al.  Power and Performance Characterization and Modeling of GPU-Accelerated Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[108]  Margaret Martonosi,et al.  Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data , 2003, MICRO.

[109]  Yale N. Patt,et al.  Improving branch prediction accuracy by reducing pattern history table interference , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[110]  Wei Wu,et al.  A systematic method for functional unit power estimation in microprocessors , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[111]  Gokcen Kestor,et al.  Enabling accurate power profiling of HPC applications on exascale systems , 2013, ROSS '13.

[112]  Matthias S. Müller,et al.  Characterizing the energy consumption of data transfers and arithmetic operations on x86−64 processors , 2010, International Conference on Green Computing.

[113]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[114]  Martin Schulz,et al.  Finding the limits of power-constrained application performance , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[115]  Lizy Kurian John,et al.  GPU triggered networking for intra-kernel communications , 2017, SC.

[116]  Kenneth C. Smith,et al.  Through the Looking Glass?The 2015 Edition: Trends in Solid-State Circuits from ISSCC , 2015, IEEE Solid-State Circuits Magazine.

[117]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[118]  Natalie D. Enright Jerger,et al.  Interconnect-Memory Challenges for Multi-chip, Silicon Interposer Systems , 2015, MEMSYS.

[119]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[120]  Shekhar Y. Borkar Exascale Computing - A Fact or a Fiction? , 2013, IPDPS.

[121]  Vignesh Adhinarayanan Performance, power, and energy of in-situ and post-processing visualization , 2015 .

[122]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[123]  Sunita Chandrasekaran,et al.  Statistical modeling of power/energy of scientific kernels on a multi-GPU system , 2013, 2013 International Green Computing Conference Proceedings.

[124]  David M. Brooks,et al.  Energy characterization and instruction-level energy model of Intel's Xeon Phi processor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[125]  Wu-chun Feng,et al.  Making a Case for Efficient Supercomputing , 2003, ACM Queue.

[126]  Lizy Kurian John,et al.  Complete System Power Estimation Using Processor Performance Events , 2012, IEEE Transactions on Computers.

[127]  M. Martonosi,et al.  Detecting recurrent phase behavior under real-system variability , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[128]  Gokcen Kestor,et al.  Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[129]  Dong Li,et al.  PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications , 2010, IEEE Transactions on Parallel and Distributed Systems.