Voltage-Stacked GPUs: A Control Theory Driven Cross-Layer Solution for Practical Voltage Stacking in GPUs

More than 20% of the available energy is lost in "the last centimeter" from the PCB board to the microprocessor chip due to inherent inefficiencies of power delivery subsystems (PDSs) in today's computing systems. By series-stacking multiple voltage domains to eliminate explicit voltage conversion and reduce loss along the power delivery path, voltage stacking (VS) is a novel configuration that can improve power delivery efficiency (PDE). However, VS suffers from aggravated levels of supply noise caused by current imbalance between the stacking layers, preventing its practical adoption in mainstream computing systems. Throughput-centric manycore architectures such as GPUs intrinsically exhibit more balanced workloads, yet suffer from lower PDE, making them ideal platforms to implement voltage stacking. In this paper, we present a cross-layer approach to practical voltage stacking implementation in GPUs. It combines circuit-level voltage regulation using distributed charge-recycling integrated voltage regulators (CR-IVRs) with architecture-level voltage smoothing guided by control theory. Our proposed voltage-stacked GPUs can eliminate 61.5% of total PDS energy loss and achieve 92.3% system-level power delivery efficiency, a 12.3% improvement over the conventional single-layer based PDS. Compared to the circuit-only solution, the cross-layer approach significantly reduces the implementation cost of voltage stacking (88% reduction in area overhead) without compromising supply reliability under worst-case scenarios and across a wide range of real-world benchmarks. In addition, we demonstrate that the cross-layer solution not only complements on-chip CR-IVRs to transparently manage current imbalance and restore stable layer voltages, but also serves as a seamless interface to accommodate higher-level power optimization techniques, traditionally thought to be incompatible with a VS configuration.

[1]  Jose Renau,et al.  GPU NTC Process Variation Compensation With Voltage Stacking , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Chia-Lin Yang,et al.  Power gating strategies on GPUs , 2011, TACO.

[3]  Scott A. Mahlke,et al.  Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Gu-Yeon Wei,et al.  A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications , 2016, IEEE Journal of Solid-State Circuits.

[5]  Michael G. Pollitt,et al.  The Economics of Energy (and Electricity) Demand , 2011 .

[6]  Andrew B. Kahng,et al.  Logic Design Partitioning for Stacked Power Domains , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Henry Hoffmann,et al.  GRAPE: Minimizing energy for GPU applications with performance requirements , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Meeta Sharma Gupta,et al.  Understanding Voltage Variations in Chip Multiprocessors using a Distributed Power-Delivery Network , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[9]  Xin He,et al.  Wide Operational Range Processor Power Delivery Design for Both Super-Threshold Voltage and Near-Threshold Voltage Computing , 2016, Journal of Computer Science and Technology.

[10]  Xin He,et al.  SuperRange: Wide operational range power delivery design for both STV and NTV computing , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Gu-Yeon Wei,et al.  Ivory: Early-stage design space exploration tool for integrated voltage regulators , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Xuan Zhang,et al.  A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[13]  T. Rahal-Arabi,et al.  On-die droop detector for analog sensing of power supply noise , 2004, IEEE Journal of Solid-State Circuits.

[14]  Jing Li,et al.  Fast lock scheme for phase-locked loops , 2009, 2009 IEEE Custom Integrated Circuits Conference.

[15]  Arjun Majumdar,et al.  A Low-Power Microcontroller in a 40-nm CMOS Using Charge Recycling , 2017, IEEE Journal of Solid-State Circuits.

[16]  J. Kim,et al.  An efficient digital sliding controller for adaptive power supply regulation , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[17]  Keith A. Jenkins,et al.  A statistical critical path monitor in 14nm CMOS , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[18]  A.V. Peterchev,et al.  Load-Line Regulation With Estimated Load-Current Feedforward: Application to Microprocessor Voltage Regulators , 2006, IEEE Transactions on Power Electronics.

[19]  Yue Wang,et al.  Run-time power-gating in caches of GPUs for leakage energy savings , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20]  Gu-Yeon Wei,et al.  A 16-core voltage-stacked system with an integrated switched-capacitor DC-DC converter , 2015, 2015 Symposium on VLSI Circuits (VLSI Circuits).

[21]  John Keane,et al.  A multi-story power delivery technique for 3D integrated circuits , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).

[22]  Tajana Simunic,et al.  Multi-variable dynamic power management for the GPU subsystem , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  Patrik Larsson,et al.  di/dt Noise in CMOS Integrated Circuits , 1997 .

[24]  Haoran Li,et al.  Workload-Aware Adaptive Power Delivery System Management for Many-Core Processors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Andrew B. Kahng,et al.  Floorplan and placement methodology for improved energy reduction in stacked power-domain design , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[26]  Nuno Roma,et al.  Fast and Scalable Thread Migration for Multi-core Architectures , 2015, 2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing.

[27]  Gu-Yeon Wei,et al.  A fully integrated battery-connected switched-capacitor 4:1 voltage regulator with 70% peak efficiency using bottom-plate charge recycling , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[28]  Jingwen Leng,et al.  Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture , 2014 .

[29]  Indrani Paul,et al.  Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[30]  Indrani Paul,et al.  Dynamic GPGPU Power Management Using Adaptive Model Predictive Control , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[31]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[32]  Kofi A. A. Makinwa,et al.  A microcontroller with 96% power-conversion efficiency using stacked voltage domains , 2016, 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits).

[33]  Jaydeep Kulkarni,et al.  A 0.45–1 V Fully-Integrated Distributed Switched Capacitor DC-DC Converter With High Density MIM Capacitor in 22 nm Tri-Gate CMOS , 2014, IEEE Journal of Solid-State Circuits.

[34]  Puneet Gupta,et al.  Multi-story power distribution networks for GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Kevin Skadron,et al.  A cross-layer design exploration of charge-recycled power-delivery in many-layer 3D-IC , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  Fabrice Paillet,et al.  FIVR — Fully integrated voltage regulators on 4th generation Intel® Core™ SoCs , 2014, 2014 IEEE Applied Power Electronics Conference and Exposition - APEC 2014.

[37]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[38]  Mircea R. Stan,et al.  SRAM based Opportunistic Energy Efficiency Improvement in Dual-Supply Near-Threshold Processors , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[39]  Jingwen Leng,et al.  GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[40]  Xuan Zhang,et al.  SRAM based opportunistic energy efficiency improvement in dual-supply near-threshold processors , 2018, DAC.

[41]  Jose Renau,et al.  Managing Mismatches in Voltage Stacking with CoreUnfolding , 2016, ACM Trans. Archit. Code Optim..

[42]  Li Zhou,et al.  Core tunneling: Variation-aware voltage noise mitigation in GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[43]  Pradip Bose,et al.  Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  S. Rajapandian,et al.  Implicit DC-DC downconversion through charge-recycling , 2005, IEEE Journal of Solid-State Circuits.

[45]  Jose Renau,et al.  Level shifter design for voltage stacking , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[46]  Mircea R. Stan,et al.  Breaking the power delivery wall using voltage stacking , 2012, GLSVLSI '12.

[47]  Jose Renau,et al.  SRAM voltage stacking , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[48]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[49]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[50]  Xuan Zhang,et al.  Efficient and Reliable Power Delivery in Voltage-Stacked Manycore System with Hybrid Charge-Recycling Regulators , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[51]  Zhe Wang,et al.  An Analytical Study of Power Delivery Systems for Many-Core Processors Using On-Chip and Off-Chip Voltage Regulators , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[52]  Gu-Yeon Wei,et al.  Characterizing and evaluating voltage noise in multi-core near-threshold processors , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[53]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[54]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[55]  Michael D. Smith,et al.  Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[56]  Rong Ge,et al.  Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU , 2013, 2013 42nd International Conference on Parallel Processing.

[57]  Radu Teodorescu,et al.  EmerGPU: Understanding and mitigating resonance-induced voltage noise in GPU architectures , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[58]  Meeta Sharma Gupta,et al.  GPUVolt: Modeling and characterizing voltage noise in GPU architectures , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[59]  Naehyuck Chang,et al.  Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[60]  Robert Bruce Findler,et al.  Exploring circuit timing-aware language and compilation , 2011, ASPLOS XVI.

[61]  Gu-Yeon Wei,et al.  Evaluation of voltage stacking for near-threshold multicore computing , 2012, ISLPED '12.

[62]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[63]  Margaret Martonosi,et al.  Control techniques to eliminate voltage emergencies in high performance processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[64]  Pradip Bose,et al.  Microarchitectural techniques for power gating of execution units , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[65]  Meeta Sharma Gupta,et al.  Eliminating voltage emergencies via software-guided code transformations , 2010, TACO.

[66]  Mohammad Abdel-Majeed,et al.  Warped gates: Gating aware scheduling and power gating for GPGPUs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).