Energy-Efficient DNN Computing on GPUs Through Register File Management

Current Deep Neural Networks(DNNs) that involve a mass of matrix multiplications and other similar operations can be well paralleled and thus accelerated by GPUs. However, energy consumption is still a big concern for DNN, which can limit the scalability of performance increase. In this paper, we propose to utilize the specific micro-architectures of GPUs and the DNN application characteristics to improve energy efficiency. A huge register file (RF) is often necessary for modern GPUs to hold contexts of thousands of concurrent threads. Consequently, the GPU RF which is constructed with high leakage transistors contributes significantly to GPUs total energy consumption and thus smart RF management strategies can help GPUs to reduce energy consumption when scaling up the hardware resources for enhanced performance. First, based on the observation that there are a large fraction of narrow-width operands in DNNs, we propose to use a GPU register packing scheme to use the RF more efficiently. Second, we introduce the drowsy RF with a simple policy to decrease the leakage energy consumption. Finally, we attempt to further improve RF energy efficiency by taking advantage of the cooperation of drowsy RF and register packing techniques. We evaluate the effectiveness of our GPU RF management schemes on energy reduction using AlexNet which is a state-of-the-art DNN model. The experimental results show that the combination of the register packing and drowsy techniques achieves the most total GPU energy consumption reduction, up to 11% and 10.3% on average.

[1]  David Blaauw,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, ISCA.

[2]  Massoud Pedram,et al.  Design and application of multimodal power gating structures , 2009, 2009 10th International Symposium on Quality Electronic Design.

[3]  Yurong Chen,et al.  Dynamic Network Surgery for Efficient DNNs , 2016, NIPS.

[4]  Xin Fu,et al.  Soft-error reliability and power co-optimization for GPGPUs register file using resistive memory , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[7]  Wei Zhang,et al.  Drowsy Register Files for Reducing GPU Leakage Energy , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[8]  Rui Peng,et al.  Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[9]  Seth Copen Goldstein,et al.  BitValue Inference: Detecting and Exploiting Narrow Bitwidth Computations , 2000, Euro-Par.

[10]  Won Woo Ro,et al.  Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[11]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[12]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2013, HPCA.

[14]  Wei Zhang,et al.  GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[15]  R. Venkatesh Babu,et al.  Data-free Parameter Pruning for Deep Neural Networks , 2015, BMVC.

[16]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[17]  Shuicheng Yan,et al.  Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods , 2016, ArXiv.

[18]  Shuai Wang,et al.  On the Exploitation of Narrow-Width Values for Improving Register File Reliability , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Margaret Martonosi,et al.  Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[20]  Kanad Ghose,et al.  Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  William J. Dally,et al.  Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[23]  Shuai Wang,et al.  In-Register Duplication: Exploiting Narrow-Width Value for Improving Register File Reliability , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[24]  Hiroshi Nakamura,et al.  A small, fast and low-power register file by bit-partitioning , 2005, 11th International Symposium on High-Performance Computer Architecture.