Accelerating tile low-rank GEMM on sunway architecture: POSTER

Tile Low-Rank (TLR) GEMM can significantly reduce the amount of computation and memory footprint for matrix multiplication while preserving the same level of accuracy [1]. TLR-GEMM is based on the TLR data format, which is an efficient method to store large-scale sparse matrix. The large matrix is divided into several blocks also known as tile, and non-diagonal tile is compressed into the product of two tall and skinny matrices (in low-rank data format). TLR-GEMM performs the multiplication of TLR matrix A and B to obtain matrix C. TLR-GEMM can be implemented in batch mode, that is, multiple threads are started, and each thread applies the operations onto its corresponding tiles, including dense GEMM, SVD and QR decomposition. One research challenge in the field of TLR-GEMM is that modern high-performance processors often use diverse architectures, which requires adapting to the unique architecture features to achieve better performance.

[1]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[2]  David E. Keyes,et al.  Tile Low-Rank GEMM Using Batched Operations on GPUs , 2018, Euro-Par.