Low-memory GEMM-based convolution algorithms for deep neural networks

Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different ways to express DNN convolution operations using GEMM. Although different approaches all perform the same number of operations, the size of temporary data structures differs significantly. Convolution of an input matrix with dimensions $C \times H \times W$, requires $O(K^2CHW)$ additional space using the classical im2col approach. More recently memory-efficient approaches requiring just $O(KCHW)$ auxiliary space have been proposed. We present two novel GEMM-based algorithms that require just $O(MHW)$ and $O(KW)$ additional space respectively, where $M$ is the number of channels in the result of the convolution. These algorithms dramatically reduce the space overhead of DNN convolution, making it much more suitable for memory-limited embedded systems. Experimental evaluation shows that our low-memory algorithms are just as fast as the best patch-building approaches despite requiring just a fraction of the amount of additional memory. Our low-memory algorithms have excellent data locality which gives them a further edge over patch-building algorithms when multiple cores are used. As a result, our low memory algorithms often outperform the best patch-building algorithms using multiple threads.

[1]  Jack J. Dongarra,et al.  Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[2]  Yibing Liu,et al.  OpenCL caffe: Accelerating and enabling a cross platform machine learning framework , 2016, IWOCL.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Hai Liu,et al.  Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.

[5]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Keiji Yanai,et al.  Efficient Mobile Implementation of A CNN-based Object Recognition System , 2016, ACM Multimedia.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  David Gregg,et al.  Parallel Multi Channel convolution using General Matrix Multiplication , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[11]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[12]  Daniel Brand,et al.  MEC: Memory-efficient Convolution for Deep Neural Network , 2017, ICML.

[13]  Tyler Highlander,et al.  Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add , 2016, BMVC.