FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why

Winograd-based convolution has quickly gained traction as a preferred approach to implement convolutional neural networks (ConvNet) on various hardware platforms because it requires fewer floating point operations than FFT-based or direct convolutions. This paper compares three highly optimized implementations (regular FFT--, Gauss--FFT--, and Winograd--based convolutions) on modern multi-- and many--core CPUs. Although all three implementations employed the same optimizations for modern CPUs, our experimental results with two popular ConvNets (VGG and AlexNet) show that the FFT--based implementations generally outperform the Winograd--based approach, contrary to the popular belief. To understand the results, we use a Roofline performance model to analyze the three implementations in detail, by looking at each of their computation phases and by considering not only the number of floating point operations, but also the memory bandwidth and the cache sizes. The performance analysis explains why, and under what conditions, the FFT--based implementations outperform the Winograd--based one, on modern CPUs.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[3]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[4]  Frédo Durand,et al.  Optimizing N-dimensional, winograd-based convolution for manycore CPUs , 2018, PPoPP.

[5]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Nir Shavit,et al.  Deep Tensor Convolution on Multicores , 2016, ICML.

[7]  H. Sebastian Seung,et al.  ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[10]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  H. Sebastian Seung,et al.  Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs , 2017, ICS '17.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[15]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Victor Y. Pan,et al.  How Bad Are Vandermonde Matrices? , 2015, SIAM J. Matrix Anal. Appl..

[19]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[20]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[21]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[22]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  H. Sebastian Seung,et al.  ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[25]  Kevin Vincent,et al.  On Improving the Numerical Stability of Winograd Convolutions , 2017, ICLR.

[26]  G. Henry,et al.  LIBXSMM: A High Performance Library for Small Matrix Multiplications , 2015 .

[27]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[28]  M. D. MacLaren The Art of Computer Programming. Volume 2: Seminumerical Algorithms (Donald E. Knuth) , 1970 .