HPTT: a high-performance tensor transposition C++ library

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g.,a 4 x 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.

[1]  Ian T. Foster,et al.  Design and Performance of a Scalable Parallel Community Climate Model , 1995, Parallel Comput..

[2]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[3]  Gregory H. Bauer,et al.  Optimizing matrix transposes using a POWER7 cache model and explicit prefetching , 2012, PERV.

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  Marin van Heel A fast algorithm for transposing large multidimensional image data sets , 1991 .

[6]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7]  Robert J. Harrison,et al.  MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation , 2015, SIAM J. Sci. Comput..

[8]  Paolo Bientinesi,et al.  TTC: a tensor transposition compiler for multiple architectures , 2016, ARRAY@PLDI.

[9]  John McCalpin,et al.  Automatic benchmark generation for cache optimization of matrix operations , 1995, ACM-SE 33.

[10]  Dmitry I. Lyakh An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU , 2015, Comput. Phys. Commun..

[11]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[12]  Geoffrey C. Goldbogen,et al.  PRIM: A Fast Matrix Transpose Method , 1981, IEEE Transactions on Software Engineering.

[13]  Paolo Bientinesi,et al.  TTC: A high-performance Compiler for Tensor Transpositions , 2017, ACM Trans. Math. Softw..

[14]  Sriram Krishnamoorthy,et al.  Combining analytical and empirical approaches in tuning matrix transposition , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Dmitry I. Lyakh,et al.  cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs , 2017, ArXiv.

[16]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[17]  Andrey Vladimirov Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors , 2013 .

[18]  Siddhartha Chatterjee,et al.  Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[19]  Lai Wei,et al.  Autotuning Tensor Transposition , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[20]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[21]  M. Head‐Gordon,et al.  A fifth-order perturbation comparison of electron correlation theories , 1989 .

[22]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[23]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[24]  Ibai Gurrutxaga,et al.  Efficient 3D Transpositions in Graphics Processing Units , 2015, International Journal of Parallel Programming.

[25]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.