cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum <inline-formula><tex-math notation="LaTeX">$16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>16</mml:mn><mml:mo>.</mml:mo><mml:mn>91</mml:mn><mml:mo>×</mml:mo><mml:mo>,</mml:mo><mml:mn>27</mml:mn><mml:mo>.</mml:mo><mml:mn>03</mml:mn><mml:mo>×</mml:mo><mml:mo>,</mml:mo><mml:mn>38</mml:mn><mml:mo>.</mml:mo><mml:mn>97</mml:mn><mml:mo>×</mml:mo><mml:mo>,</mml:mo><mml:mn>22</mml:mn><mml:mo>.</mml:mo><mml:mn>36</mml:mn><mml:mo>×</mml:mo><mml:mo>,</mml:mo><mml:mn>15</mml:mn><mml:mo>.</mml:mo><mml:mn>43</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="liu-ieq1-2940192.gif"/></alternatives></inline-formula> speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum <inline-formula><tex-math notation="LaTeX">$9.80 \times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>9</mml:mn><mml:mo>.</mml:mo><mml:mn>80</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="liu-ieq2-2940192.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$269.26 \times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>269</mml:mn><mml:mo>.</mml:mo><mml:mn>26</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="liu-ieq3-2940192.gif"/></alternatives></inline-formula> speedups over multi-core CPU implementations.

[1]  Feng Qian,et al.  Tensor Super-resolution for Seismic Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Prasanna Balaprakash,et al.  Generating Efficient Tensor Contractions for GPUs , 2015, 2015 44th International Conference on Parallel Processing.

[3]  Dhabaleswar K. Panda,et al.  Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , 2019, IEEE Transactions on Parallel and Distributed Systems.

[4]  Lieven Eeckhout,et al.  HeteroCore GPU to Exploit TLP-Resource Diversity , 2019, IEEE Transactions on Parallel and Distributed Systems.

[5]  Eric L. Miller,et al.  Tensor-Based Formulation and Nuclear Norm Regularization for Multienergy Computed Tomography , 2013, IEEE Transactions on Image Processing.

[6]  Adam Zalcman,et al.  TensorNetwork: A Library for Physics and Machine Learning , 2019, ArXiv.

[7]  Tao Deng,et al.  Tensor Sensing for Rf Tomographic Imaging , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[8]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[9]  Gerik Scheuermann,et al.  Fast and Memory Efficient GPU-Based Rendering of Tensor Data , 2011 .

[10]  Rafael Ballester-Ripoll,et al.  Multiresolution Volume Filtering in the Tensor Compressed Domain , 2018, IEEE Transactions on Visualization and Computer Graphics.

[11]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[12]  Tao Zhang,et al.  High-Performance Homomorphic Matrix Completion on GPUs , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[13]  Xiaodong Wang,et al.  Adaptive Sampling of RF Fingerprints for Fine-Grained Indoor Localization , 2015, IEEE Transactions on Mobile Computing.

[14]  Johan A. K. Suykens,et al.  Learning with tensors: a framework based on convex optimization and spectral regularization , 2014, Machine Learning.

[15]  Xiaodong Wang,et al.  Low-Tubal-Rank Tensor Completion Using Alternating Minimization , 2016, IEEE Transactions on Information Theory.

[16]  Misha Elena Kilmer,et al.  Novel Methods for Multilinear Data Completion and De-noising Based on Tensor-SVD , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[18]  Misha Elena Kilmer,et al.  Third-Order Tensors as Operators on Matrices: A Theoretical and Computational Framework with Applications in Imaging , 2013, SIAM J. Matrix Anal. Appl..

[19]  U. N. Niranjan,et al.  Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, HiPC 2016.

[20]  Thomas B. Rolinger,et al.  Performance challenges for heterogeneous distributed tensor decompositions , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[21]  J. H. Choi,et al.  DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[22]  M. Kilmer,et al.  Factorization strategies for third-order tensors , 2011 .

[23]  Lin-Ching Chang,et al.  GPU acceleration of nonlinear diffusion tensor estimation using CUDA and MPI , 2014, Neurocomputing.

[24]  Xiaodong Wang,et al.  LS-Decomposition for Robust Recovery of Sensory Big Data , 2018, IEEE Transactions on Big Data.

[25]  Kenli Li,et al.  CUSNTF: A Scalable Sparse Non-negative Tensor Factorization Model for Large-scale Industrial Applications on Multi-GPU , 2018, CIKM.

[26]  Bora Uçar,et al.  High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[27]  David E. Keyes,et al.  Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression , 2017, Parallel Comput..

[28]  Bingsheng He,et al.  Scalable GPU Virtualization with Dynamic Sharing of Graphics Memory Space , 2018, IEEE Transactions on Parallel and Distributed Systems.

[29]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[30]  Christos Faloutsos,et al.  GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[31]  Ying-Jer Kao,et al.  GPU accelerated tensor contractions in the plaquette renormalization scheme , 2011 .

[32]  Zheng Shou,et al.  Deep Tensor ADMM-Net for Snapshot Compressive Imaging , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Dmitry I. Lyakh An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU , 2015, Comput. Phys. Commun..

[34]  Tao Zhang,et al.  Cutensor-tubal: Optimized GPU Library for Low-tubal-rank Tensors , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Hong Chen,et al.  GPUTENSOR: Efficient tensor factorization for context-aware recommendations , 2015, Inf. Sci..

[36]  Markku Hauta-Kasari,et al.  Nonnegative Tensor Factorization Accelerated Using GPGPU , 2011, IEEE Transactions on Parallel and Distributed Systems.

[37]  Tao Zhang,et al.  High-Performance Tensor Decoder on GPUs for Wireless Camera Networks in IoT , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[38]  Nikos D. Sidiropoulos,et al.  Tensors for Data Mining and Data Fusion , 2016, ACM Trans. Intell. Syst. Technol..

[39]  Andrzej Cichocki,et al.  Tensor Decompositions for Signal Processing Applications: From two-way to multiway component analysis , 2014, IEEE Signal Processing Magazine.

[40]  Dmitry I. Lyakh,et al.  cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs , 2017, ArXiv.

[41]  Hongtao Lu,et al.  Efficient Multi-Dimensional Tensor Sparse Coding Using t-Linear Combination , 2018, AAAI.

[42]  David A. Patterson,et al.  A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution , 2018, IEEE Micro.

[43]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[44]  Athanasios V. Vasilakos,et al.  CDC: Compressive Data Collection for Wireless Sensor Networks , 2015, IEEE Transactions on Parallel and Distributed Systems.