DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission

Deep neural networks (DNNs) are key computational building blocks for emerging classes of web services that interact in real time with users via voice, images and video inputs. Although GPUs have gained popularity as a key accelerator platform for deep learning workloads, the increasing demand for DNN computation leaves a significant gap between the compute capabilities of GPU-enabled datacenters and the compute needed to service demand. The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip band-width while overlooking on-chip bandwidth, a key DNN bottleneck. To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques–(l) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of $2.1\times$ on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to $2.6\times$CCS CONCEPTS• Computing methodologies $\rightarrow $Neural networks;• Computer systems organization $\rightarrow $Single instruction, multiple data; • General and reference $\rightarrow $Performance;

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[3]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[4]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[6]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[8]  Johannes Schemmel,et al.  Wafer-scale integration of analog neural networks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[9]  Moon Ho Lee,et al.  Performance Analysis of Bit-Width Reduced Floating-Point Arithmetic Units in FPGAs: A Case Study of Neural Network-Based Face Detector , 2009, EURASIP J. Embed. Syst..

[10]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Hoi-Jun Yoo,et al.  A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.

[13]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[14]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[15]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[16]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[17]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[19]  Jim D. Garside,et al.  SpiNNaker: A multi-core System-on-Chip for massively-parallel neural net simulation , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[20]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[21]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[22]  Myron Flickner,et al.  Compass: A scalable simulator for an architecture for cognitive computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[26]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[27]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[28]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[29]  Yu Wang,et al.  Training itself: Mixed-signal training acceleration for memristor-based neural network , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[30]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[32]  Olivier Temam,et al.  Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[33]  Trevor Darrell,et al.  Recognizing Image Style , 2013, BMVC.

[34]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[35]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[36]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[37]  Margrit Betke,et al.  Salient Object Subitizing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.

[39]  Michele Magno,et al.  Accelerating real-time embedded scene labeling with convolutional networks , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[40]  Andrew Lavin,et al.  maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs , 2015, ArXiv.

[41]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[42]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[43]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[44]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[45]  Yoshua Bengio,et al.  Low precision arithmetic for deep learning , 2014, ICLR.

[46]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[47]  Luca Benini,et al.  A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[48]  Scott A. Mahlke,et al.  Input responsiveness: using canary inputs to dynamically steer approximation , 2016, PLDI.

[49]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[51]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[52]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[53]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[54]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[55]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[56]  Natalie D. Enright Jerger,et al.  Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks , 2016, ICS.

[57]  Song Han,et al.  Deep compression and EIE: Efficient inference engine on compressed deep neural network , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[58]  Scott A. Mahlke,et al.  Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[60]  Hadi Esmaeilzadeh,et al.  Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).