Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors

In high-performance computing, the general matrix-matrix multiplication (xGEMM) routine is the core of the Level 3 BLAS kernel for effective matrix-matrix multiplication operations. The performance of parallel xGEMM (PxGEMM) is significantly affected by two main factors: the flop rate that can be achieved by calculating the operations and the communication costs for broadcasting submatrices to others. In this study, an approach is proposed to improve and adjust the parallel double-precision general matrix-matrix multiplication (PDGEMM) routine for modern Intel computers such as Knights Landing (KNL) and Xeon Scalable Processors (SKL). The proposed approach consists of two methods to deal with the aforementioned factors. First, the improvement of PDGEMM for the computational part is suggested based on a blocked GEMM algorithm that provides better fits for the architectures of KNL and SKL to perform better block size computation. Second, a communication routine adjustment with the message passing interface is proposed to overcome the settings of the basic linear algebra communication subprograms to improve the time-wise cost efficiency. Consequently, it is shown that performance improvements are achieved in the case of smaller matrix multiplications on the SKL clusters.

[1]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[2]  B. R. Nanjesh,et al.  Performance evaluation and comparison of MPI and PVM using a cluster based parallel computing architecture , 2013, 2013 International Conference on Circuits, Power and Computing Technologies (ICCPCT).

[3]  Laxmikant V. Kale,et al.  Heterogeneous computing with OpenMP and Hydra , 2020, Concurr. Comput. Pract. Exp..

[4]  Rolf Hempel,et al.  The MPI Standard for Message Passing , 1994, HPCN.

[5]  Salvatore Filippone Parallel Libraries on Distributed Memory Architectures: The IBM Parallel ESSL , 1996, PARA.

[6]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2004, PARA.

[7]  J. Choi,et al.  A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[8]  Jaeyoung Choi,et al.  Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512 , 2019, HPC Asia.

[9]  Osvaldo Gervasi,et al.  On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond , 2021, ACM Trans. Archit. Code Optim..

[10]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[11]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[12]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[13]  Dhabaleswar K. Panda,et al.  High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[14]  Al Geist,et al.  A survey of high-performance computing scaling challenges , 2017, Int. J. High Perform. Comput. Appl..

[15]  Jack Dongarra,et al.  Pvm: A Users' Guide and Tutorial for Network Parallel Computing , 1994 .

[16]  Sarita V. Adve,et al.  HPVM: heterogeneous parallel virtual machine , 2018, PPoPP.

[17]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[18]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[19]  Jaeyoung Choi,et al.  OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing , 2018, HPC Asia Workshops.

[20]  Satoshi Matsuoka,et al.  High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures , 2018, ICPP Workshops.

[21]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[22]  Adrián Castelló,et al.  Programming parallel dense matrix factorizations with look-ahead and OpenMP , 2018, Cluster Computing.

[23]  Tao Tang,et al.  LU factorization on heterogeneous systems: an energy-efficient approach towards high performance , 2016, Computing.

[24]  Zhengji Zhao,et al.  Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture , 2017 .

[25]  Jaeyoung Choi,et al.  Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors , 2018, The Journal of Supercomputing.

[26]  Jaeyoung Choi,et al.  An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512 , 2018, Cluster Computing.

[27]  William B. Sawyer,et al.  An MPI implementation of the BLACS , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[28]  Moritz Diehl,et al.  The BLAS API of BLASFEO , 2019, ACM Trans. Math. Softw..

[29]  Jaeyoung Choi,et al.  A new parallel matrix multiplication algorithm on distributed‐memory concurrent computers , 1998 .

[30]  R. V. D. Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Transactions on Mathematical Software.

[31]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[32]  Xiaowen Chu,et al.  Optimizing batched winograd convolution on GPUs , 2020, PPoPP.

[33]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[34]  Kostas Katrinis,et al.  A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.