Towards Optimal Winograd Convolution on Manycores

As convolutional layers are computationally intensive and dominate the total execution time of modern, deep Convolutional neural networks (ConvNets) [15, 22, 24, 25], many efforts have been made to improve the performance of the convolutional primitives for CPUs [1, 6, 26, 31, 33], GPUs [4, 7, 19, 27] or both [32]. An important class of improvements is to reduce the computations required for a convolution. Several efforts employed FFT-based convolutions to reduce the required computations for GPUs [19, 27] and CPUs [31, 32]. Recently, Lavin et al. [16] proposed an algorithm based on the Winograd algorithm for minimal filtering, originally developed for fast computation of finite impulse response (FIR) filters [30]. The key idea of the Winograd–based convolution is similar to the one based on FFTs. The inputs and the kernels are first transformed, and then an element–wise multiplication, which is an equivalent problem to a matrix multiplication, is performed. An inverse transformation of the result yields the result of the convolution. Unlike the FFT–based convolution, where the point-wise multiplications are done in the complex domain, the Winograd–based convolution operates on real numbers, thus requiring fewer operations. After Lavin et al. [16] demonstrated that Winograd–based convolution can be more efficient than FFT in reducing the number of multiplications, especially for small 2D kernels (e.g. 3 × 3), Nervana [4] and Nvidia’s cuDNN [7] had implemented Winograd–based convolution for GPUs. CPU implementations were also provided by FALCON [1], LIBXSMM[5], Intel MKL-DNN [2] and Budden et al. [6]. In addition, currently available implementations support only 2D convolutions and single kernel size (3 × 3), which restricts the range of the application of Winograd–based convolution. 3D ConvNets are becoming important, as they have been successfully applied to many fields [9, 14, 20, 21]. Our work on the design, implementation and evaluation of a fast Winograd-based implementation is motivated by the fact that current Winograd–based implementations for multicore or manycore CPUs perform well below the hardware capability. In many cases, they under–performed more computationally expensive, but more optimized implementations, such as direct convolutions.

[1]  S. Winograd Arithmetic complexity of computations , 1980 .

[2]  Vijay Madisetti The Digital Signal Processing Handbook, Second Edition - 3 Volume Set , 2009 .

[3]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[6]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[7]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[8]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[9]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  G. Henry,et al.  LIBXSMM: A High Performance Library for Small Matrix Multiplications , 2015 .

[12]  Kai Li,et al.  Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Sebastian Scherer,et al.  3D Convolutional Neural Networks for landing zone detection from LiDAR , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[17]  H. Sebastian Seung,et al.  ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  H. Sebastian Seung,et al.  ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[19]  Thomas Brox,et al.  3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation , 2016, MICCAI.

[20]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  H. Sebastian Seung,et al.  Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs , 2017, ICS '17.

[23]  Daniel Thalmann,et al.  3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Gianni De Fabritiis,et al.  DeepSite: protein‐binding site predictor using 3D‐convolutional neural networks , 2017, Bioinform..

[25]  Nir Shavit,et al.  Deep Tensor Convolution on Multicores , 2016, ICML.

[26]  Kevin Vincent,et al.  On Improving the Numerical Stability of Winograd Convolutions , 2017, ICLR.

[27]  Frédo Durand,et al.  Optimizing N-dimensional, winograd-based convolution for manycore CPUs , 2018, PPoPP.