HPTT: a high-performance tensor transposition C++ library
暂无分享,去创建一个
[1] Ian T. Foster,et al. Design and Performance of a Scalable Parallel Community Climate Model , 1995, Parallel Comput..
[2] Jeff Johnson,et al. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.
[3] Gregory H. Bauer,et al. Optimizing matrix transposes using a POWER7 cache model and explicit prefetching , 2012, PERV.
[4] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[5] Marin van Heel. A fast algorithm for transposing large multidimensional image data sets , 1991 .
[6] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[7] Robert J. Harrison,et al. MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation , 2015, SIAM J. Sci. Comput..
[8] Paolo Bientinesi,et al. TTC: a tensor transposition compiler for multiple architectures , 2016, ARRAY@PLDI.
[9] John McCalpin,et al. Automatic benchmark generation for cache optimization of matrix operations , 1995, ACM-SE 33.
[10] Dmitry I. Lyakh. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU , 2015, Comput. Phys. Commun..
[11] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.
[12] Geoffrey C. Goldbogen,et al. PRIM: A Fast Matrix Transpose Method , 1981, IEEE Transactions on Software Engineering.
[13] Paolo Bientinesi,et al. TTC: A high-performance Compiler for Tensor Transpositions , 2017, ACM Trans. Math. Softw..
[14] Sriram Krishnamoorthy,et al. Combining analytical and empirical approaches in tuning matrix transposition , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[15] Dmitry I. Lyakh,et al. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs , 2017, ArXiv.
[16] R. Bartlett,et al. Coupled-cluster theory in quantum chemistry , 2007 .
[17] Andrey Vladimirov. Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors , 2013 .
[18] Siddhartha Chatterjee,et al. Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).
[19] Lai Wei,et al. Autotuning Tensor Transposition , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[20] Dmitry Pekurovsky,et al. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..
[21] M. Head‐Gordon,et al. A fifth-order perturbation comparison of electron correlation theories , 1989 .
[22] Paolo Bientinesi,et al. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..
[23] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[24] Ibai Gurrutxaga,et al. Efficient 3D Transpositions in Graphics Processing Units , 2015, International Journal of Parallel Programming.
[25] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.