Reduced-Precision Memory Value Approximation for Deep Learning

Neural networks (NNs) and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. As the complexity of the NNs increases over the years, the performance of them are highly limited by the hardware resources. We identified that memory capacity and memory bandwidth are bottlenecks for highly parallel and high-performance deep NN implementations. In this work, we propose reduced-precision memory value approximation to reduce the required memory bandwidth. Our design highlights an approximator in the memory controller for reduced-precision access to the weight parameters. Our results show that reduced precision to 8-bit representation for a random 20% of the whole set of weights causes only 1% of accuracy loss in the object classification tasks on ImageNet. Another important observation is that the errorresilience of weight parameters varies from layer to layer, with fully connected layers being more prone to the reduced precision approximation. Our proposal is expected to achieve significant memory bandwidth savings with less than 1% loss of accuracy. External Posting Date: December 14, 2015 [Fulltext] Internal Posting Date: December 14, 2015 [Fulltext]  Copyright 2015 Hewlett Packard Enterprise Development LP Reduced-Precision Memory Value Approximation for Deep Learning Zhaoxia Deng, Cong Xu, Qiong Cai, and Paolo Faraboschi University of California, Santa Barbara

[1]  J. Goodman Using cache memory to reduce processor-memory traffic , 1983, ISCA '83.

[2]  A. Iwata,et al.  An artificial neural network accelerator using general purpose 24 bit floating point digital signal processors , 1989, International 1989 Joint Conference on Neural Networks.

[3]  D. Hammerstrom,et al.  A VLSI architecture for high-performance, low-cost, on-chip learning , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[4]  Jenq-Neng Hwang,et al.  Finite Precision Error Analysis of Neural Network Hardware Implementations , 1993, IEEE Trans. Computers.

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Donald Yeung,et al.  Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption , 2002 .

[7]  Armando Vieira,et al.  A training algorithm for classification of high-dimensional data , 2003, Neurocomputing.

[8]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation , 2003, Lecture Notes in Computer Science.

[9]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[10]  Yann LeCun,et al.  Off-Road Obstacle Avoidance through End-to-End Learning , 2005, NIPS.

[11]  M. Ekman,et al.  A robust main-memory compression scheme , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Jürgen Schmidhuber,et al.  Sequence Labelling in Structured Domains with Hierarchical Recurrent Neural Networks , 2007, IJCAI.

[14]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[15]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[17]  H. Sebastian Seung,et al.  Natural Image Denoising with Convolutional Networks , 2008, NIPS.

[18]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[19]  Luca Maria Gambardella,et al.  Deep, Big, Simple Neural Nets for Handwritten Digit Recognition , 2010, Neural Computation.

[20]  Madhukar Budagavi,et al.  Memory Bandwidth and Power Reduction Using Lossy Reference Frame Compression in Video Encoding , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[22]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[23]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[24]  Jia Deng,et al.  Large scale visual recognition , 2012 .

[25]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[29]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[30]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[32]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[33]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  Rong Gu,et al.  Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[35]  Björn W. Schuller,et al.  Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling , 2014, INTERSPEECH.

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Rajeev Balasubramonian,et al.  MemZip: Exploring unconventional benefits from memory compression , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[38]  Dinggang Shen,et al.  Deep Learning Based Imaging Data Completion for Improved Brain Disease Diagnosis , 2014, MICCAI.

[39]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[40]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[41]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[42]  Yoshua Bengio,et al.  Low precision arithmetic for deep learning , 2014, ICLR.

[43]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.