A memory controller for improved performance of streamed computations on symmetric multiprocessors

The growing disparity between processor and memory speeds has caused memory bandwidth to become the performance bottleneck for many applications. In particular this performance gap severely impacts stream-orientated computations such as (de)compression, encryption, and scientific vector processing. This paper describes the development of an intelligent memory interface that can exploit compiler-provided information on streamed memory access patterns to improve memory bandwidth. Simulation results show that such shared-memory multiprocessor systems can deliver nearly the full attainable bandwidth with relatively modest hardware costs.

[1]  James R. Goodman,et al.  The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[2]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[3]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[4]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[5]  AyguadéEduard,et al.  Increasing the number of strides for conflict-free vector access , 1992 .

[6]  Eduard Ayguadé,et al.  Increasing the number of strides for conflict-free vector access , 1992, ISCA '92.

[7]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[8]  Sally A. McKee,et al.  Experimental implementation of dynamic access ordering , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[9]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[10]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[11]  Sally A. McKee,et al.  Increasing Memory Bandwidth for Vector Computations , 1994, Programming Languages and System Architectures.

[12]  Steven A. Moyer,et al.  Access Ordering and Effective Memory Bandwidth , 1993 .

[13]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[14]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[15]  Tzi-cker Chiueh,et al.  Sunder: a programmable hardware prefetch architecture for numerical loops , 1994, Proceedings of Supercomputing '94.

[16]  Zhiyuan Li,et al.  An Empirical Study of the Workload Distribution under Static Scheduling , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[17]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[18]  David T. Harper,et al.  Increased Memory Performance During Vector Accesses Through the use of Linear Address Transformations , 1992, IEEE Trans. Computers.

[19]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[20]  Q. S. Gao The Chinese Remainder Theorem And The Prime Memory System , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[21]  Ken Kennedy,et al.  Blocking Linear Algebra Codes for Memory Hierarchies , 1989, PPSC.