Auto-tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor
暂无分享,去创建一个
[1] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.
[2] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.
[3] Joan-Manuel Parcerisa,et al. The latency hiding effectiveness of decoupled access/execute processors , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).
[4] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[5] Hu Weiwu. Optimization of matrix multiplication based on a multi-core architecture extended with vector units , 2011 .
[6] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..
[7] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.
[8] Lukasz Szustak,et al. Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture , 2012, Parallel Comput..
[9] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[10] Xu Yang,et al. Godson-3B: A 1GHz 40W 8-core 128GFLOPS processor in 65nm CMOS , 2011, 2011 IEEE International Solid-State Circuits Conference.