Recently, FPGAs have been widely used in the implementation of hardware accelerators for Convolutional Neural Networks (CNN), especially on mobile and embedded devices. However, most of these existing accelerators are designed with the same concept as their ASIC counterparts, that is all operations from different CNN layers are mapped to the same hardware units and work in a multiplexed way. Although this approach improves the generality of these accelerators, it does not take full advantage of reconfigurability and customizability of FPGAs, resulting in a certain degree of computational efficiency degradation, which is even worse on the embedded platforms. In this paper, we propose an FPGA-based CNN accelerator with all the layers mapped to their own on-chip units, and working concurrently as a pipeline. A strategy which can find the optimized paralleling scheme for each layer is proposed to eliminate the pipeline stall and achieve high resource utilization. In addition, a balanced pruning-based method is applied on fully connected (FC) layers to reduce the computational redundancy. As a case study, we implement a widely used CNNs model, LeNet-5, on an embedded FPGA device, Xilinx Zedboard. It can achieve a peak performance of 39.78 GOP/s and the power efficiency with a value 19.6 GOP/s/W which outperforms previous approaches.
[1]
R. Sindhu Reddy,et al.
DLAU: A Scalable Deep Learning Accelerator Unit on FPGA
,
2018
.
[2]
Yu Wang,et al.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
,
2016,
FPGA.
[3]
Yiran Chen,et al.
MoDNN: Local distributed mobile computing system for Deep Neural Network
,
2017,
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.
[4]
Jason Cong,et al.
Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks
,
2016,
2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[5]
Jason Cong,et al.
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
,
2015,
FPGA.