A high utilization FPGA-based accelerator for variable-scale convolutional neural network

Convolutional Neural Network (CNN) plays an essential role in computer vision applications for high classification accuracy and robust generalization capability. In recent years, various GPU-based or application-specific hardware approaches have been proposed to accelerate CNN computations. However, for variable-scale CNNs, the utilization of DSP on chip is not able to achieve very high due to the boundary of image. In this paper, we propose an optimization framework to solve boundary problem and connect our accelerator with ARM processors and DDR4 memory through dual Advanced eXtensible Interface (AXI) bus. Each port is capable of a peak throughout of 1.6 GB/s in full duplex. The accelerator has the ability to perform 160 G-op/s at peak and achieve 96% computing resource utilization.

[1]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[2]  Luca Benini,et al.  Origami: A Convolutional Network Accelerator , 2015, ACM Great Lakes Symposium on VLSI.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Luca Benini,et al.  YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[5]  Takayuki Ito,et al.  Neocognitron: A neural network model for a mechanism of visual pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[7]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).