论文信息 - GEMM Optimization for a Decoupled Access/Execute Architecture Processor

GEMM Optimization for a Decoupled Access/Execute Architecture Processor

General dense matrix multiplication operation (GEMM) has a great impact on the performance of supercomputers. The typical blocking algorithm of GEMM divides the operation in to several levels. Because of the limitation of data transmission, previous works only focuses on optimizing the lowest level of the GEMM based on generalpurpose processors. However, the Decoupled Access/Execute architecture (DAE) processor has enhanced the ability of data fetching. This paper will introduce several methods for optimizing GEMM based on a DAE processor. The Execute Processors (EP) of the platform is made up by Direct Register Access (DRA) and Direct Cache Access (DCA), which can be used to manage the data transport between registers, cache and memory. This paper pays more attention on optimizing the high levels of the blocking GEMM. The GEMM kernel based on the DAE processor was divided into 4 levels, and several levels of the new algorithm are capable to self-adjust. And the performance of our algorithm was effectively improved.

Naijie Gu | Yangzhao Yang | Zeng Zhao

[1] Robert A. van de Geijn,et al. High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories , 2001 .

[2] Wang Qian,et al. openblas: a high performance blas library on loongson 3a cpu , 2011 .

[3] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[4] Abdolah Chalechale,et al. Scheduling in Multiprocessor System Using Genetic Algorithm , 2012 .

[5] Hu Weiwu. Optimization of matrix multiplication based on a multi-core architecture extended with vector units , 2011 .

[6] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[7] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[8] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2004, PARA.

[9] Xu Yang,et al. Godson-3B: A 1GHz 40W 8-core 128GFLOPS processor in 65nm CMOS , 2011, 2011 IEEE International Solid-State Circuits Conference.

[10] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[11] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.