论文信息 - A Performance Model of Dense Matrix Operations on Many-Core Architectures

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number P max . We show that $P_{max}=\Theta(B\ast \sqrt{C})$. P max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P max . The model is validated by a comparison between the theoretical performance and experimental data of previous works.

[1] Saurabh Dighe,et al. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[2] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[3] Geppino Pucci,et al. The Potential of On-Chip Multiprocessing for QCD Machines , 2005, HiPC.

[4] Guang R. Gao,et al. Experience on optimizing irregular computation for memory hierarchy in manycore architecture , 2008, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[5] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[6] Ioannis E. Venetis,et al. Optimizing the LU Benchmark for the Cyclops-64 Architecture , 2009 .

[7] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[8] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[9] Guang R. Gao,et al. Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[10] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .