Abstract The main contribution of this paper is to show efficient implementations of the convolution-pooling in the GPU, in which the pooling follows the multiple convolution. Since the multiple convolution and the pooling operations are performed alternately in earlier stages of many Convolutional Neural Networks (CNNs), it is very important to accelerate the convolution-pooling. Our new GPU implementation uses two techniques, (1) convolution interchange with direct sum, and (2) conversion to matrix multiplication. By these techniques, the computational and memory access cost are reduced. Further the convolution interchange is converted to matrix multiplication, which can be computed by cuBLAS very efficiently. Experimental results using Tesla V100 GPU show that our new GPU implementation compatible with cuDNN for the convolution-pooling is expected 2.90 times and 1.43 times faster for fp32 and fp16 than the multiple convolution and then the pooling by cuDNN, respectively. the most popular library of primitives to implement the CNNs in the GPU.
[1]
S. Winograd.
Arithmetic complexity of computations
,
1980
.
[2]
Koji Nakano,et al.
Efficient Canny Edge Detection Using a GPU
,
2010,
2010 First International Conference on Networking and Computing.
[3]
Vivienne Sze,et al.
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
,
2017,
Proceedings of the IEEE.
[4]
Wen-mei W. Hwu,et al.
GPU Computing Gems Emerald Edition
,
2011
.
[5]
Tao Zhang,et al.
A Survey of Model Compression and Acceleration for Deep Neural Networks
,
2017,
ArXiv.
[6]
Meng Zhang,et al.
Recent Advances in Convolutional Neural Network Acceleration
,
2018,
Neurocomputing.
[7]
John Tran,et al.
cuDNN: Efficient Primitives for Deep Learning
,
2014,
ArXiv.