Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

In recent years, Convolutional Neural Networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often bounded by the computational capability of FPGAs (e.g., the number of DSPs). In this paper, we demonstrate that fast Winograd algorithm can dramatically reduce the arithmetic complexity, and improve the performance of CNNs on FPGAs. We first propose a novel architecture for implementing Winograd algorithm on FPGAs. Our design employs line buffer structure to effectively reuse the feature map data among different tiles. We also effectively pipeline the Winograd PE engine and initiate multiple PEs through parallelization. Meanwhile, there exists a complex design space to explore. We propose an analytical model to predict the resource usage and reason about the performance. Then, we use the model to guide a fast design space exploration. Experiments using the state-of-the-art CNNs demonstrate the best performance and energy efficiency on FPGAs. We achieve an average 1006.4 GOP/s for the convolutional layers and 854.6 GOP/s for the overall AlexNet and an average 3044.7 GOP/s for the convolutional layers and 2940.7 GOP/s for the overall VGG16 on Xilinx ZCU102 platform.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[3]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[8]  Xiaowei Li,et al.  C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[11]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[12]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yun Liang,et al.  High-Level Synthesis: Productivity, Performance, and Software Constraints , 2012, J. Electr. Comput. Eng..

[16]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[17]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[18]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jian Sun,et al.  Efficient and accurate approximations of nonlinear convolutional networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[22]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[23]  S. Winograd Arithmetic complexity of computations , 1980 .

[24]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Viktor Prasanna,et al.  Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System , 2017, FPGA.