A Resource-Limited Hardware Accelerator for Convolutional Neural Networks in Embedded Vision Applications

In this brief, we introduce an architecture for accelerating convolution stages in convolutional neural networks (CNNs) implemented in embedded vision systems. The purpose of the architecture is to exploit the inherent parallelism in CNNs to reduce the required bandwidth, resource usage, and power consumption of highly computationally complex convolution operations as required by real-time embedded applications. We also implement the proposed architecture using fixed-point arithmetic on a ZC706 evaluation board that features a Xilinx Zynq-7000 system on-chip, where the embedded ARM processor with high clocking speed is used as the main controller to increase the flexibility and speed. The proposed architecture runs under a frequency of 150 MHz, which leads to 19.2 Giga multiply accumulation operations per second while consuming less than 10 W in power. This is done using only 391 DSP48 modules, which shows significant utilization improvement compared to the state-of-the-art architectures.

[1]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[2]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[5]  R. Sindhu Reddy,et al.  DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2018 .

[6]  Luca Benini,et al.  A Heterogeneous Multicore System on Chip for Energy Efficient Brain Inspired Computing , 2018, IEEE Transactions on Circuits and Systems II: Express Briefs.

[7]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[8]  Kiyoung Choi,et al.  Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[11]  Soheil Ghiasi,et al.  Design space exploration of FPGA-based Deep Convolutional Neural Networks , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).