Bounding on the gain of optimizing data layout in vector processors

In vector processors, the number of memory banks (m) is generally larger than or equal to the memory access time divided with the processor cycle time. This ratio is denoted t, i.e. rn 2 t. Data is moved between the vector registers and the memory using long sequences of memory accesses for which the addresses are separated by a fixed distance called the stride. For some strides, the performance is seriously degraded due to memory bank conflicts. Many scientific applications are based on large matrices, and for such programs it is well known that the most unfavorable strides can be avoided by adding a number of dummy columns or by using hardware skewing. We present an optimal upper bound on the number of access conflicts when optimizing the data layout in this way. Programs are categorized according to their strides, and the worst-case behavior for each such category is given in a theorem. The result shows that for worst-case scenarios the number of conflicts increases rapidly when t grows, e.g. if we want to keep the worstcase behavior relatively constant when t grows from 6 to 10, we need to at least double the number of memory banks. The result is valid for skewed as well as for non-skewed memory systems.

[1]  Q. S. Gao The Chinese remainder theorem and the prime memory system , 1993, ISCA '93.

[2]  Gene H. Golub,et al.  Matrix computations , 1983 .

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  David T. Harper,et al.  Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.

[5]  Benoît Dupont de Dinechin A ultra fast Euclidean division algorithm for prime memory systems , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[6]  Henry D. Shapiro,et al.  Theoretical Limitations on the Efficient Use of Parallel Memories , 1978, IEEE Transactions on Computers.

[7]  Ashoke Deb Multiskewing-A Novel Technique for Optimal Parallel Memory Access , 1996, IEEE Trans. Parallel Distributed Syst..

[8]  John P. Hayes,et al.  On randomly interleaved memories , 1990, Proceedings SUPERCOMPUTING '90.

[9]  Eduard Ayguadé,et al.  Conflict-Free Access for Streams in Multimodule Memories , 1995, IEEE Trans. Computers.