Efficient Hardware Architectures for Deep Convolutional Neural Network

Convolutional neural network (CNN) is the state-of-the-art deep learning approach employed in various applications. Real-time CNN implementations in resource limited embedded systems are becoming highly desired recently. To ensure the programmable flexibility and shorten the development period, field programmable gate array is appropriate to implement the CNN models. However, the limited bandwidth and on-chip memory storage are the bottlenecks of the CNN acceleration. In this paper, we propose efficient hardware architectures to accelerate deep CNN models. The theoretical derivation of parallel fast finite impulse response algorithm (FFA) is introduced. Based on FFAs, the corresponding fast convolution units (FCUs) are developed for the computation of convolutions in the CNN models. Novel data storage and reuse schemes are proposed, where all intermediate pixels are stored on-chip and the bandwidth requirement is reduced. We choose one of the largest and most accurate networks, VGG16, and implement it on Xilinx Zynq ZC706 and Virtex VC707 boards, respectively. We achieve the top-5 accuracy of 86.25% using an equal distance non-uniform quantization method. It is estimated that the average performances are 316.23 GOP/s under 172-MHz working frequency on Xilinx ZC706 and 1250.21 GOP/s under 170-MHz working frequency on VC707, respectively. In brief, the proposed design outperforms the existing works significantly, in particular, surpassing related designs by more than two times in terms of resource efficiency.

[1]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[2]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Tsutomu Sasao,et al.  A deep convolutional neural network based on nested residue number system , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[4]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[5]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[7]  Soheil Ghiasi,et al.  Design space exploration of FPGA-based Deep Convolutional Neural Networks , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[10]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Keshab K. Parhi,et al.  Hardware efficient fast parallel FIR filter structures based on iterated short convolution , 2004, IEEE Trans. Circuits Syst. I Regul. Pap..

[12]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Zhongfeng Wang,et al.  Intra-layer nonuniform quantization of convolutional neural network , 2016, 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP).

[14]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[15]  R. Reis,et al.  Low power 3–2 and 4–2 adder compressors implemented using ASTRAN , 2012, 2012 IEEE 3rd Latin American Symposium on Circuits and Systems (LASCAS).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[18]  Keshab K. Parhi,et al.  Area-efficient parallel FIR digital filter implementations , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[19]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[20]  Luca Benini,et al.  YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[21]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Mohammad H. Mahoor,et al.  Going deeper in facial expression recognition using deep neural networks , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Klaus Kofler,et al.  Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[30]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.