A Deeper Look at FFT and Winograd Convolutions

Since convolutional layers are computationally expensive and dominate the total execution time of modern deep ConvNets [13, 16, 18, 19], many efforts have been made to improve the performance of the convolutional primitives for CPUs [1, 7, 20, 25, 27], GPUs [4, 8, 15, 21] or both [26]. Initially, several approaches using FFT–based convolutions were proposed [15, 21, 25, 26]. Recent work by Lavin et al. on Winograd–based convolutions [14] demonstrated a great speedup, which shifted the focus from FFT–based to Winograd–based implementations, as it became widely accepted that the Winograd–based approach provides greater reduction in the number of operations required by the algorithm, especially for small kernels (e.g. 3 × 3). A well optimized manycore CPU implementation [3, 12] of the Winograd approach can improve the performances by more than 3X. The main reduction in operations in the Winograd method, compared to FFT, comes from the fact that it works with real numbers. However, due to its numerical instability, the Winograd method can only use small tile (transform) sizes [7, 14, 22], which result in a larger amount of required data movement to and from memory. In contrast, the FFT–based method does not suffer from such instability, thus larger tile sizes can be used, which can partially reduce the number of required operations and greatly reduce the amount of data movements; these savings can, in certain cases, offset the increase in the number of operations due to complex arithmetic. These observations raise the question, under what conditions the Winograd-based approach performs better than the FFT– based approach and vice versa, and how to compare the two approaches.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[6]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[7]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[8]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[9]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[13]  H. Sebastian Seung,et al.  ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  H. Sebastian Seung,et al.  ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[15]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  H. Sebastian Seung,et al.  Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs , 2017, ICS '17.

[18]  Nir Shavit,et al.  Deep Tensor Convolution on Multicores , 2016, ICML.

[19]  Kevin Vincent,et al.  On Improving the Numerical Stability of Winograd Convolutions , 2017, ICLR.

[20]  Frédo Durand,et al.  Optimizing N-dimensional, winograd-based convolution for manycore CPUs , 2018, PPoPP.