论文信息 - Auto-tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor

Auto-tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor

A typical decoupled access/execute architecture (DAE) processor is consisting of Access Processors (AP) and Execute Processors (EP). The overhead of memory access of AP can be hidden by calculation of EP. Based on this principle, a new optimization algorithm of general dense matrix multiplication operation (GEMM) will be introduced in this paper. The algorithm is divided into four levels, every level of which uses different storage structure of the processor. That makes the algorithm closely combine with the features of DAE architecture. Furthermore, a fetch performance evaluation system - DAEFS will also be introduced. It is a runtime system with very little overhead, and can be used to collect the information about the relationship between fetching and calculation for DAE processors. With the help of DAEFS, several levels of the new algorithm are capable to self-adjust. Consequently, the algorithm can find the best block parameters for several levels during runtime. We implement the algorithm with DAEFS on a DAE processor platform - Godson-3B, which has been used to build up the teraflop computer - KD90. The performance of the optimized GEMM on the real platform matches the best simulation results of this processor.

Naijie Gu | Yangzhao Yang | Zeng Zhao

[1] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.

[2] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[3] Joan-Manuel Parcerisa,et al. The latency hiding effectiveness of decoupled access/execute processors , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[4] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[5] Hu Weiwu. Optimization of matrix multiplication based on a multi-core architecture extended with vector units , 2011 .

[6] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[7] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[8] Lukasz Szustak,et al. Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture , 2012, Parallel Comput..

[9] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[10] Xu Yang,et al. Godson-3B: A 1GHz 40W 8-core 128GFLOPS processor in 65nm CMOS , 2011, 2011 IEEE International Solid-State Circuits Conference.