GEMM Optimization for a Decoupled Access/Execute Architecture Processor
暂无分享,去创建一个
[1] Robert A. van de Geijn,et al. High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories , 2001 .
[2] Wang Qian,et al. openblas: a high performance blas library on loongson 3a cpu , 2011 .
[3] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[4] Abdolah Chalechale,et al. Scheduling in Multiprocessor System Using Genetic Algorithm , 2012 .
[5] Hu Weiwu. Optimization of matrix multiplication based on a multi-core architecture extended with vector units , 2011 .
[6] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..
[7] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.
[8] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2004, PARA.
[9] Xu Yang,et al. Godson-3B: A 1GHz 40W 8-core 128GFLOPS processor in 65nm CMOS , 2011, 2011 IEEE International Solid-State Circuits Conference.
[10] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[11] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.