Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

3-D convolutional neural networks (3-D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on design and optimization of accelerators for 2-D CNNs, with few attempts having been made to accelerate 3-D CNNs on FPGA. We find the acceleration of 3-D CNNs on FPGA to be challenging due to their high computational complexity and storage demands. More importantly, although the computational patterns of 2-D and 3-D CNNs are analogous, the conventional approaches that have been adopted for acceleration of 2-D CNNs may be unfit for 3-D CNN acceleration. In this paper, in order to accelerate 2-D and 3-D CNNs using a uniform framework, we first propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure the rapid development of 2-D and 3-D CNN accelerators. Then, with the aim of efficiently mapping all layers of 2-D /3-D CNNs onto a pipelined accelerator, techniques are developed to improve the throughput and computational efficiency of the accelerator, including layer fusion, layer clustering, and workload-balancing scheme. Finally, we demonstrate the effectiveness of the deep pipelined architecture by accelerating real-life 2-D and 3-D CNNs on the state-of-the-art FPGA platform. On VCU118, we achieve 3.7 TOPS for VGG-16, which outperforms state-of-the-art FPGA-based CNN accelerators. Comparisons with CPU and GPU solutions demonstrate that our implementation of 3-D CNN achieves gains of up to <inline-formula> <tex-math notation="LaTeX">$17.8\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$64.2\times $ </tex-math></inline-formula> in performance and energy relative to a CPU solution, and a <inline-formula> <tex-math notation="LaTeX">$5.0\times $ </tex-math></inline-formula> energy efficiency gain over a GPU solution.

[1]  Leibo Liu,et al.  A High Throughput Acceleration for Hybrid Neural Networks With Efficient Resource Management on FPGA , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[2]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew C. Ling,et al.  An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[7]  Xuegong Zhou,et al.  A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[8]  Jason Cong,et al.  Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[11]  Raunak Dey,et al.  Diagnostic classification of lung nodules using 3D neural networks , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[12]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.

[13]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[14]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Shmuel Winograd,et al.  On Multiplication of Polynomials Modulo a Polynomial , 1980, SIAM J. Comput..

[16]  Zelong Wang,et al.  Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA , 2018, FPGA.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Konstantinos Kamnitsas,et al.  Efficient multi‐scale 3D CNN with fully connected CRF for accurate brain lesion segmentation , 2016, Medical Image Anal..

[20]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[21]  Xiangyu Li,et al.  LCP: a Layer Clusters Paralleling mapping method for accelerating Inception and Residual networks on FPGA , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[22]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.

[23]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[24]  Leibo Liu,et al.  Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[26]  Michael Ferdman,et al.  Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Yijie Wang,et al.  High Performance Implementation of 3D Convolutional Neural Networks on a GPU , 2017, Comput. Intell. Neurosci..

[28]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Leibo Liu,et al.  A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications , 2018, IEEE Journal of Solid-State Circuits.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[32]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[35]  Hadi Esmaeilzadeh,et al.  TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).