A Software-Hardware collaboration system for CNN algorithms based on FPGA

In this paper, a SoC system with ARM processor and convolution accelerator is designed for CNN algorithms on the ZC706 evaluation board. Using tiling technology and loop reorganization, the system has a high data reuse rate, thus greatly reducing the data bandwidth between the on-chip buffer and DDR memory. This convolution accelerator supports different kernel size from $1 \times 1$ to $11 \times 11$, while the activation functions supported are ReLU and Leaky ReLU. The processor of the SoC is mainly responsible for controlling and processing other computations of the CNN, such as LRN and pooling, which makes the system more versatile and flexible. At the working frequency of 100MHz, the peak performance can reach 45.16 GFLOPS, which is 142.8x faster than Cortex-A9 and the energy efficiency is 219.5x better compared to i7-4790K.