A Performance Model of Dense Matrix Operations on Many-Core Architectures

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number P max . We show that $P_{max}=\Theta(B\ast \sqrt{C})$. P max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P max . The model is validated by a comparison between the theoretical performance and experimental data of previous works.