New Techniques for Power-Efficient CPU-GPU Processors

of “New Techniques for Power-Efficient CPU-GPU Processors” by Kapil Dev, Ph.D., Brown University, May 2017 Power is one of the key challenges for improving the performance of modern CPU-GPU processors. Research efforts are needed at both design-time and run-time of processor to improve its power efficiency (Performance/Watt). To improve the run-time power management, accurate measurement based power models are needed. Further, the power efficiency of CPU-GPU processors for different workloads depends on the type of device they run on and the run-time conditions of the system [e.g., thermal design power (TDP) and existence of other workloads]. So, an online workload characterization and mapping method is needed. Furthermore, for future massively parallel processors, the low power techniques, like power gating (PG) should be evaluated for their potential benefits before going through the cost of implementing them. This thesis makes the following contributions towards improving the performance and power efficiency of CPU-GPU processors. First, we propose new techniques for postsilicon power mapping and modeling of multi-core processors using infrared imaging and performance counter measurements. Using detailed thermal and power maps, we demonstrate that in contrast to traditional multi-core CPUs heterogeneous processors exhibit higher intertwined behavior for dynamic voltage and frequency scaling (DVFS) and workload scheduling, in terms of their effect on performance, power and temperature. Second, we propose a framework to map workloads on appropriate device of CPU-GPU processors under different static and time-varying workload/system conditions. We implement the scheduler on a real CPU-GPU processor, and using OpenCL benchmarks, we demonstrate up to 24% runtime improvement and 10% energy savings compared to the state-of-the-art scheduling techniques. Third, to improve the performance and power efficiency of future massively parallel GPUs, we provide an integrated solution to manage leakage power by incorporating workload/run-time-awareness into the PG design methodology. On a hypothetical future GPU with 192 compute units, our results show that a PG

[1]  Sudhakar Yalamanchili,et al.  Harmonia: Balancing compute and memory power in high-performance GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[2]  Sherief Reda,et al.  Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Margaret Martonosi,et al.  Computer Architecture Techniques for Power-Efficiency , 2008, Computer Architecture Techniques for Power-Efficiency.

[4]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2014, Sci. Program..

[5]  Anand Raghunathan,et al.  MDR: performance model driven runtime for heterogeneous parallel platforms , 2011, ICS '11.

[6]  Li Shen,et al.  PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Farid N. Najm,et al.  Full-Chip Model for Leakage-Current Estimation Considering Within-Die Correlation , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Chia-Lin Yang,et al.  Power gating strategies on GPUs , 2011, TACO.

[9]  Bishop Brock,et al.  Accurate Fine-Grained Processor Power Proxies , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[11]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[12]  Andrew A. Chien,et al.  Abstract: An Exascale Workload Study , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  Jong-Myon Kim,et al.  An efficient scheduling scheme using estimated execution time for heterogeneous computing systems , 2013, The Journal of Supercomputing.

[14]  J. Murthy,et al.  Leakage Power Dependent Temperature Estimation to Predict Thermal Runaway in FinFET Circuits , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[15]  T. Kemper,et al.  Ultrafast Temperature Profile Calculation in Ic Chips , 2006 .

[16]  Francisco J. Cazorla,et al.  Power and thermal characterization of POWER6 system , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Farid N. Najm,et al.  Power estimation techniques for integrated circuits , 1995, Proceedings of IEEE International Conference on Computer Aided Design (ICCAD).

[18]  Hao Wang,et al.  Workload and power budget partitioning for single-chip heterogeneous processors , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Martin Schulz,et al.  Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems , 2014, 2014 43rd International Conference on Parallel Processing.

[20]  Nam Sung Kim,et al.  Optimizing throughput of power- and thermal-constrained multicore processors using DVFS and per-core power-gating , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[21]  Shrirang M. Yardi,et al.  CAMP: A technique to estimate per-structure power at run-time using a few simple parameters , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[22]  Martin Schulz,et al.  Bounding energy consumption in large-scale MPI programs , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Sherief Reda,et al.  Identifying the optimal energy-efficient operating points of parallel workloads , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[24]  Scott A. Mahlke,et al.  SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration , 2015, ACM Trans. Comput. Syst..

[25]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[26]  Takayasu Sakurai,et al.  Power gating: Circuits, design methodologies, and best practice for standard-cell VLSI designs , 2010, TODE.

[27]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2013, HPCA.

[28]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[30]  Lei He,et al.  Temperature and supply Voltage aware performance and power modeling at microarchitecture level , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[32]  Sherief Reda,et al.  High-throughput TSV testing and characterization for 3D integration using thermal mapping , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[33]  B. Dally It's about the power: An architect's view of interconnect , 2012, 2012 IEEE International Interconnect Technology Conference.

[34]  Jeffrey S. Vetter,et al.  Maestro: Data Orchestration and Tuning for OpenCL Devices , 2010, Euro-Par.

[35]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[36]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[37]  Puneet Gupta,et al.  Quantifying error in dynamic power estimation of CMOS circuits , 2003, Fourth International Symposium on Quality Electronic Design, 2003. Proceedings..

[38]  Margaret Martonosi,et al.  Run-time power estimation in high performance microprocessors , 2001, ISLPED '01.

[39]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[40]  Rafael Asenjo,et al.  Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors , 2016, IEEE Transactions on Parallel and Distributed Systems.

[41]  E. Cohen,et al.  Hotspot-Limited Microprocessors: Direct Temperature and Power Distribution Measurements , 2007, IEEE Journal of Solid-State Circuits.

[42]  Sri Parameswaran,et al.  CLIPPER: Counter-based Low Impact Processor Power Estimation at Run-time , 2007, 2007 Asia and South Pacific Design Automation Conference.

[43]  Malgorzata Marek-Sadowska,et al.  Benefits and costs of power-gating technique , 2005, 2005 International Conference on Computer Design.

[44]  Sherief Reda,et al.  Post-silicon power mapping techniques for integrated circuits , 2013, Integr..

[45]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[46]  José M. García,et al.  Energy Efficiency Analysis of GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[47]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[48]  Sherief Reda,et al.  Post-silicon power characterization using thermal infrared emissions , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[49]  Courtenay T. Vaughan,et al.  Energy based performance tuning for large scale high performance computing systems , 2012, HiPC 2012.

[50]  Uri C. Weiser,et al.  Interconnect-power dissipation in a microprocessor , 2004, SLIP '04.

[51]  Margaret Martonosi,et al.  Techniques for Multicore Thermal Management: Classification and New Exploration , 2006, ISCA 2006.

[52]  Anja Vogler,et al.  Heat Transfer Thermal Management Of Electronics , 2016 .

[53]  Jose Renau,et al.  Characterizing processor thermal behavior , 2010, ASPLOS XV.

[54]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[55]  Indrani Paul,et al.  A Taxonomy of GPGPU Performance Scaling , 2015, 2015 IEEE International Symposium on Workload Characterization.

[56]  Muhammad Shafique,et al.  Improving mobile gaming performance through cooperative CPU-GPU thermal management , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[57]  Stijn Eyerman,et al.  Fine-grained DVFS using on-chip regulators , 2011, TACO.

[58]  Huazhong Yang,et al.  Accurate temperature-dependent integrated circuit leakage power estimation is easy , 2007 .

[59]  Vanish Talwar,et al.  Power Management of Datacenter Workloads Using Per-Core Power Gating , 2009, IEEE Computer Architecture Letters.

[60]  Robert H. Dennard,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007 .

[61]  Cloyce D. Spradling SPEC CPU2006 benchmark tools , 2007, CARN.

[62]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[63]  Robert J. Fowler,et al.  SoftPower: fine-grain power estimations using performance counters , 2010, HPDC '10.

[64]  Lizy Kurian John,et al.  Runtime identification of microprocessor energy saving opportunities , 2005, ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005..

[65]  Sherief Reda,et al.  Thermal and power characterization of field-programmable gate arrays , 2011, FPGA '11.

[66]  Sherief Reda,et al.  Power-aware characterization and mapping of workloads on CPU-GPU processors , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[67]  Michael F. P. O'Boyle,et al.  Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[68]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[69]  Sally A. McKee,et al.  Real time power estimation and thread scheduling via performance counters , 2009, CARN.

[70]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[71]  Sherief Reda,et al.  Power Modeling and Characterization of Computing Devices: A Survey , 2012, Found. Trends Electron. Des. Autom..

[72]  Shahin Nazarian,et al.  Dynamic thermal management for FinFET-based circuits exploiting the temperature effect inversion phenomenon , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[73]  Phil Rogers,et al.  Heterogeneous system architecture overview , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[74]  Murali Annavaram,et al.  PATS: Pattern aware scheduling and power gating for GPGPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[75]  Kevin Skadron,et al.  Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data , 2011 .

[76]  Stephen Kosonocky Practical power gating and dynamic voltage/frequency scaling , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[77]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[78]  Kevin Skadron,et al.  Differentiating the roles of IR measurement and simulation for power and temperature-aware design , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[79]  Indrani Paul,et al.  A framework for evaluating promising power efficiency techniques in future GPUs for HPC , 2016, SpringSim.

[80]  Margaret Martonosi,et al.  Runtime power monitoring in high-end processors: methodology and empirical data , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[81]  Eduard Ayguadé,et al.  Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[82]  S. Nassif,et al.  Full chip leakage-estimation considering power supply and temperature variations , 2003, Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED '03..

[83]  Tulika Mitra,et al.  Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[84]  Kevin Skadron,et al.  Temperature-to-power mapping , 2010, 2010 IEEE International Conference on Computer Design.

[85]  Jose Renau,et al.  Power model validation through thermal measurements , 2007, ISCA '07.

[86]  Sherief Reda,et al.  Power mapping and modeling of multi-core processors , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[87]  S. Mukhopadhyay,et al.  Thermal system identification (TSI): A methodology for post-silicon characterization and prediction of the transient thermal field in multicore chips , 2012, 2012 28th Annual IEEE Semiconductor Thermal Measurement and Management Symposium (SEMI-THERM).

[88]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[89]  Mohammad Abdel-Majeed,et al.  Warped gates: Gating aware scheduling and power gating for GPGPUs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[90]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[91]  Wu-chun Feng,et al.  Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL , 2015, 2015 IEEE International Conference on Cluster Computing.

[92]  Jeff S. Brantley,et al.  Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems , 2010 .

[93]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[94]  Wei Wu,et al.  A systematic method for functional unit power estimation in microprocessors , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[95]  Jose Renau,et al.  Cooling solutions for processor Infrared Thermography , 2010, 2010 26th Annual IEEE Semiconductor Thermal Measurement and Management Symposium (SEMI-THERM).

[96]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[97]  Indrani Paul,et al.  Workload-Aware Power Gating Design and Run-Time Management for Massively Parallel GPGPUs , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).