LCP: a Layer Clusters Paralleling mapping method for accelerating Inception and Residual networks on FPGA

Deep convolutional neural networks (DCNNs) have been widely used in various AI applications. Inception and Residual are two promising structures adopted in many important modern DCNN models, including AlphaGo Zero's model. These structures allow considerably increasing the depth and width of the network to improve accuracy, without increasing the computational budget or the difficulty of convergence. Various accelerators for DCNNs have been proposed based on FPGA platform because it has advantages of high performance, good power efficiency, and fast development round, etc. However, previous FPGA mapping methods cannot fully adapt to the different data localities among layers and other characteristics of Inception and Residual, which leads to a under-utilization of FPGA resources. We propose LCP, a Layer Clusters Paralleling mapping method to classify the layers into clusters based on their differences of parameters and data localities, and then accelerate them in different partitions of FPGA. We evaluate our mapping method by implementing Inception/Residual modules from GoogLeNet [8] and ResNet-50 [4] on Xilinx VC709 (Virtex 690T) FPGA. The results show that the proposed method fully utilizes resources and achieves up to 4.03 × performance than the baseline and 2.00 × performance than the state-of-the-art methods.

[1]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[4]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  Kiyoung Choi,et al.  Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Michael Ferdman,et al.  Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[10]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).