GEMM Optimization for a Decoupled Access/Execute Architecture Processor

General dense matrix multiplication operation (GEMM) has a great impact on the performance of supercomputers. The typical blocking algorithm of GEMM divides the operation in to several levels. Because of the limitation of data transmission, previous works only focuses on optimizing the lowest level of the GEMM based on generalpurpose processors. However, the Decoupled Access/Execute architecture (DAE) processor has enhanced the ability of data fetching. This paper will introduce several methods for optimizing GEMM based on a DAE processor. The Execute Processors (EP) of the platform is made up by Direct Register Access (DRA) and Direct Cache Access (DCA), which can be used to manage the data transport between registers, cache and memory. This paper pays more attention on optimizing the high levels of the blocking GEMM. The GEMM kernel based on the DAE processor was divided into 4 levels, and several levels of the new algorithm are capable to self-adjust. And the performance of our algorithm was effectively improved.