Fast Convolution Operations on Many-Core Architectures

Convolution operations have been widely used in many important application domains, such as deep learning and computer vision, in which convolution is always the most time-consuming part. High computational throughput and memory bandwidth make many-core architectures the promising targets to accelerate these applications. In this paper, we implement and optimize different convolution operations, including 1D convolution, 2D convolution and multi-channel 2D convolution executed in mini-batch mode, on both GPU and Intel MIC many-core architectures. We find out that the performance bottleneck of 1D and 2D convolutions is on registers rather than local memory or L1/L2 cache, and therefore, register tiling is used to improve the performance. In addition, we present a novel solution for multi-channel 2D convolution, in which convolution is conducted on images directly instead of being translated to matrix multiplication, and the data reuse of the algorithm is fully exploited. We further summarize the parameters of autotuning for multichannel 2D convolution and prune the search space based on heuristics. The experimental results show that, for the large filter size, our solution gets up to 33% performance improvement over cuDNN-v2 and up to 28% over clBLASbased implementation, on GTX TITAN and AMD W8000 respectively. On Intel MIC, our solution gets up to 25% of the theoretical peak performance.

[1]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[2]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[3]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[4]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[5]  Martin Cadík,et al.  FFT and Convolution Performance in Image Filtering on GPU , 2006, Tenth International Conference on Information Visualisation (IV'06).

[6]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[7]  Chang-Sung Jeong,et al.  Accelerating Multi-scale Image Fusion Algorithms Using CUDA , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[8]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[9]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[10]  Victor Podlozhnyuk,et al.  Image Convolution with CUDA , 2007 .

[11]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.