Re-architecting the on-chip memory sub-system of machine-learning accelerator for embedded devices

The rapid development of deep learning are enabling a plenty of novel applications such as image and speech recognition for embedded systems, robotics or smart wearable devices. However, typical deep learning models like deep convolutional neural networks (CNNs) consume so much on-chip storage and high-throughput compute resources that they cannot be easily handled by mobile or embedded devices with thrifty silicon and power budget. In order to enable large CNN models in mobile or more cutting-edge devices for IoT or cyberphysics applications, we proposed an efficient on-chip memory architecture for CNN inference acceleration, and showed its application to our in-house general-purpose deep learning accelerator. The redesigned on-chip memory subsystem, Memsqueezer, includes an active weight buffer set and data buffer set that embrace specialized compression methods to reduce the footprint of CNN weight and data set respectively. The Memsqueezer buffer can compress the data and weight set according to their distinct features, and it also includes a built-in redundancy detection mechanism that actively scans through the work-set of CNNs to boost their inference performance by eliminating the data redundancy. In our experiment, it is shown that the CNN accelerators with Memsqueezer buffers achieves more than 2× performance improvement and reduces 80% energy consumption on average over the conventional buffer design with the same area budget.

[1]  Xiaowei Li,et al.  C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Jie Xu,et al.  DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[9]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[11]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[12]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.