Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks

Convolutional neural networks (CNNs) have been widely adopted for computer vision applications. CNNs require many multiplications, making their use expensive in terms of both computational complexity and hardware. An effective method to mitigate the number of required multiplications is via the Winograd algorithm. Previous implementations of CNNs based on Winograd use the 2-D algorithm <inline-formula> <tex-math notation="LaTeX">$F(2 \times 2,3 \times 3)$ </tex-math></inline-formula>, which reduces computational complexity by a factor of 2.25 over regular convolution. However, current Winograd implementations only apply when using a stride (shift displacement of a kernel over an input) of 1. In this article, we presented a novel method to apply the Winograd algorithm to a stride of 2. This method is valid for one, two, or three dimensions. We also introduced new Winograd versions compatible with a kernel of size 3, 5, and 7. The algorithms were successfully implemented on an NVIDIA K20c GPU. Compared to regular convolutions, the implementations for stride 2 are 1.44 times faster for a <inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula> kernel, <inline-formula> <tex-math notation="LaTeX">$2.04\times $ </tex-math></inline-formula> faster for a <inline-formula> <tex-math notation="LaTeX">$5\times 5$ </tex-math></inline-formula> kernel, <inline-formula> <tex-math notation="LaTeX">$2.42\times $ </tex-math></inline-formula> faster for a <inline-formula> <tex-math notation="LaTeX">$7 \times 7$ </tex-math></inline-formula> kernel, and <inline-formula> <tex-math notation="LaTeX">$1.73\times $ </tex-math></inline-formula> faster for a <inline-formula> <tex-math notation="LaTeX">$3 \times 3 \times 3$ </tex-math></inline-formula> kernel. Additionally, a CNN accelerator using a novel processing element (PE) performs two 2-D Winograd stride 1, or one 2-D Winograd stride 2, and operations per clock cycle was implemented on an Intel Arria-10 field-programmable gate array (FPGA). We accelerated the original and our proposed modified VGG-16 architectures and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/DSPs, respectively.

[1]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Carlo Gatta,et al.  Unsupervised Deep Feature Extraction for Remote Sensing Image Classification , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.

[6]  Yu Hu,et al.  State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[10]  Chin-Hui Lee,et al.  An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[11]  Zelong Wang,et al.  Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA , 2018, FPGA.

[12]  Jason Cong,et al.  Minimizing Computation in Convolutional Neural Networks , 2014, ICANN.

[13]  Si-Wei Chen,et al.  PolSAR Image Classification Using Polarimetric-Feature-Driven Deep Convolutional Neural Network , 2018, IEEE Geoscience and Remote Sensing Letters.

[14]  Trung-Nghia Le,et al.  Video Salient Object Detection Using Spatiotemporal Deep Features , 2017, IEEE Transactions on Image Processing.

[15]  Shengen Yan,et al.  Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[16]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[17]  Khan M. Iftekharuddin,et al.  Sparse Simultaneous Recurrent Deep Learning for Robust Facial Expression Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Bjoern M. Eskofier,et al.  Mobile Stride Length Estimation With Deep Convolutional Neural Networks , 2018, IEEE Journal of Biomedical and Health Informatics.

[19]  Fang Liu,et al.  Wishart Deep Stacking Network for Fast POLSAR Image Classification , 2016, IEEE Transactions on Image Processing.

[20]  S. Winograd Arithmetic complexity of computations , 1980 .

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[23]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Haiyan Guan,et al.  Rotation-Invariant Object Detection in High-Resolution Satellite Imagery Using Superpixel-Based Deep Hough Forests , 2015, IEEE Geoscience and Remote Sensing Letters.

[26]  Hoi-Jun Yoo,et al.  An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications , 2015, IEEE Transactions on Biomedical Circuits and Systems.

[27]  Feng Wu,et al.  Background Prior-Based Salient Object Detection via Deep Reconstruction Residual , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Y Huang,et al.  A High-efficiency FPGA-based Accelerator for Convolutional Neural Networks using Winograd Algorithm , 2018 .

[29]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[30]  Yu Wang,et al.  Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[31]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[32]  Shengen Yan,et al.  Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[33]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.