Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512
暂无分享,去创建一个
[1] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[2] Tze Meng Low,et al. The BLIS Framework , 2016 .
[3] Gang Ren,et al. Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.
[4] Jaeyoung Choi,et al. OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing , 2018, HPC Asia Workshops.
[5] Qian Wang,et al. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Jaeyoung Choi,et al. Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors , 2018, The Journal of Supercomputing.
[7] Jim Jeffers,et al. Knights Landing overview , 2016 .
[8] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[9] Pradeep Dubey,et al. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[10] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[11] Avinash Sodani,et al. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .
[12] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[13] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[14] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[15] Jaeyoung Choi,et al. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512 , 2018, Cluster Computing.
[16] Tze Meng Low,et al. Analytical Modeling Is Enough for High-Performance BLIS , 2016, ACM Trans. Math. Softw..