Fast multidimensional reduction and broadcast operations on GPU for machine learning

Reduction and broadcast operations are commonly used in machine learning algorithms for different purposes. They widely appear in the calculation of the gradient values of a loss function, which are one of the core structures of neural networks. Both operations are implemented naively in many libraries usually for scalar reduction or broadcast; however, to our knowledge, there are no optimized multidimensional implementations available. This fact limits the performance of machine learning models requiring these operations to be performed on tensors. In this work, we address the problem and propose two new strategies that extend the existing implementations to perform on tensors. We introduce formal definitions of both operations using tensor notations, investigate their mathematical properties, and exploit these properties to provide an efficient solution for each. We implement our parallel strategies and test them on a CUDA enabled Tesla K40 m GPU accelerator. Our performant implementations achieve up to 75% of the peak device memory bandwidth on different tensor sizes and dimensions. Significant speedups against the implementations available in the Knet Deep Learning framework are also achieved for both operations.

[1]  Julia Deniz Yuret Knet : beginning deep learning with 100 lines of , 2016 .

[2]  Caglar Senaras,et al.  Deep Learning for Medical Image Analysis , 2018, Journal of Pathology Informatics.

[3]  Mark J. Harris CUDA: performance tips and tricks , 2007, SIGGRAPH '07.

[4]  Christoph Meinel,et al.  Deep Learning for Medical Image Analysis , 2018, Journal of Pathology Informatics.

[5]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[6]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[9]  Kohei Ichikawa,et al.  MPI_Reduce algorithm for OpenFlow-enabled network , 2015, 2015 15th International Symposium on Communications and Information Technologies (ISCIT).

[10]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[11]  Tauno Kekäle,et al.  Beautiful Code. Leading Programmers Explain How They Think , 2009 .

[12]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Anand D. Sarwate,et al.  A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).