Efficient cuDNN-Compatible Convolution-Pooling on the GPU

The main contribution of this paper is to show efficient implementations of the convolution-pooling in the GPU, in which the pooling follows the multiple convolution. Since the multiple convolution and the pooling operations are performed alternately in earlier stages of many Convolutional Neural Networks (CNNs), it is very important to accelerate the convolution-pooling. Our new GPU implementation uses two techniques, (1) convolution interchange with direct sum, and (2) conversion to matrix multiplication. By these techniques, the computational and memory access cost are reduced. Further the convolution interchange is converted to matrix multiplication, which can be computed by cuBLAS very efficiently. Experimental results using Telsa V100 GPU show that our new GPU implementation compatible with cuDNN for the convolution-pooling is at least 1.34 times faster than the multiple convolution and then the pooling by cuDNN, the most popular library of primitives to implement the CNNs in the GPU.

[1]  Akihiko Kasagi,et al.  Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations , 2014, 2014 43rd International Conference on Parallel Processing.

[2]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[3]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[4]  Yi Yang,et al.  Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Meng Zhang,et al.  Recent Advances in Convolutional Neural Network Acceleration , 2018, Neurocomputing.

[6]  Hirotaka Tamura,et al.  Fast algorithm using summed area tables with unified layer performing convolution and average pooling , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[7]  Wen-mei W. Hwu,et al.  GPU Computing Gems Emerald Edition , 2011 .

[8]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[9]  Takumi Honda,et al.  Simple and Fast Parallel Algorithms for the Voronoi Map and the Euclidean Distance Map, with GPU Implementations , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[10]  Koji Nakano,et al.  ASCII Art Generation Using the Local Exhaustive Search on the GPU , 2013, 2013 First International Symposium on Computing and Networking.

[11]  Koji Nakano,et al.  Tile Art Image Generation Using Conditional Generative Adversarial Networks , 2018, 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW).

[12]  Takumi Honda,et al.  An Optimal Parallel Algorithm for Computing the Summed Area Table on the GPU , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[13]  Koji Nakano,et al.  Efficient Canny Edge Detection Using a GPU , 2010, 2010 First International Conference on Networking and Computing.