Auto-tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor

A typical decoupled access/execute architecture (DAE) processor is consisting of Access Processors (AP) and Execute Processors (EP). The overhead of memory access of AP can be hidden by calculation of EP. Based on this principle, a new optimization algorithm of general dense matrix multiplication operation (GEMM) will be introduced in this paper. The algorithm is divided into four levels, every level of which uses different storage structure of the processor. That makes the algorithm closely combine with the features of DAE architecture. Furthermore, a fetch performance evaluation system - DAEFS will also be introduced. It is a runtime system with very little overhead, and can be used to collect the information about the relationship between fetching and calculation for DAE processors. With the help of DAEFS, several levels of the new algorithm are capable to self-adjust. Consequently, the algorithm can find the best block parameters for several levels during runtime. We implement the algorithm with DAEFS on a DAE processor platform - Godson-3B, which has been used to build up the teraflop computer - KD90. The performance of the optimized GEMM on the real platform matches the best simulation results of this processor.