Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
暂无分享,去创建一个
Jaeyoung Choi | Yoosang Park | Raehyun Kim | Thi My Tuyen Nguyen | Jaeyoung Choi | Yoosang Park | Raehyun Kim
[1] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[2] B. R. Nanjesh,et al. Performance evaluation and comparison of MPI and PVM using a cluster based parallel computing architecture , 2013, 2013 International Conference on Circuits, Power and Computing Technologies (ICCPCT).
[3] Laxmikant V. Kale,et al. Heterogeneous computing with OpenMP and Hydra , 2020, Concurr. Comput. Pract. Exp..
[4] Rolf Hempel,et al. The MPI Standard for Message Passing , 1994, HPCN.
[5] Salvatore Filippone. Parallel Libraries on Distributed Memory Architectures: The IBM Parallel ESSL , 1996, PARA.
[6] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2004, PARA.
[7] J. Choi,et al. A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers , 1997, Proceedings 11th International Parallel Processing Symposium.
[8] Jaeyoung Choi,et al. Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512 , 2019, HPC Asia.
[9] Osvaldo Gervasi,et al. On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond , 2021, ACM Trans. Archit. Code Optim..
[10] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .
[11] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[12] Alejandro Duran,et al. The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.
[13] Dhabaleswar K. Panda,et al. High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters , 2016, 2016 45th International Conference on Parallel Processing (ICPP).
[14] Al Geist,et al. A survey of high-performance computing scaling challenges , 2017, Int. J. High Perform. Comput. Appl..
[15] Jack Dongarra,et al. Pvm: A Users' Guide and Tutorial for Network Parallel Computing , 1994 .
[16] Sarita V. Adve,et al. HPVM: heterogeneous parallel virtual machine , 2018, PPoPP.
[17] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.
[18] Jack Dongarra,et al. PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .
[19] Jaeyoung Choi,et al. OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing , 2018, HPC Asia Workshops.
[20] Satoshi Matsuoka,et al. High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures , 2018, ICPP Workshops.
[21] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[22] Adrián Castelló,et al. Programming parallel dense matrix factorizations with look-ahead and OpenMP , 2018, Cluster Computing.
[23] Tao Tang,et al. LU factorization on heterogeneous systems: an energy-efficient approach towards high performance , 2016, Computing.
[24] Zhengji Zhao,et al. Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture , 2017 .
[25] Jaeyoung Choi,et al. Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors , 2018, The Journal of Supercomputing.
[26] Jaeyoung Choi,et al. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512 , 2018, Cluster Computing.
[27] William B. Sawyer,et al. An MPI implementation of the BLACS , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).
[28] Moritz Diehl,et al. The BLAS API of BLASFEO , 2019, ACM Trans. Math. Softw..
[29] Jaeyoung Choi,et al. A new parallel matrix multiplication algorithm on distributed‐memory concurrent computers , 1998 .
[30] R. V. D. Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Transactions on Mathematical Software.
[31] Ewing L. Lusk,et al. Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.
[32] Xiaowen Chu,et al. Optimizing batched winograd convolution on GPUs , 2020, PPoPP.
[33] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[34] Kostas Katrinis,et al. A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.