Experimental implementation of dynamic access ordering

As microprocessor speeds increase, memory bandwidth is rapidly becoming the performance bottleneck in the execution of vector-like algorithms. Although caching provides adequate performance for many problems, caching alone is an insufficient solution for vector applications with poor temporal and spatial locality. Moreover, the nature of memories themselves has changed. Current DRAM components should not be treated as uniform access-time RAM: achieving greater bandwidth requires exploiting the characteristics of components at every level of the memory hierarchy. The authors describe hardware-assisted access ordering and a hardware development effort to build a Stream Memory Controller (SMC) that implements the technique for a commercially available high-performance microprocessor, the Intel i860. The strategy augments caching by combining compile-time detection of memory access patterns with a memory subsystem that decouples the order of requests generated by the processor from that issued to the memory system. This decoupling permits requests to be issued in an order that optimizes use of the memory system.<<ETX>>

[1]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[2]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[3]  Eduard Ayguadé,et al.  Increasing the number of strides for conflict-free vector access , 1992, ISCA '92.

[4]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[5]  Andrew R. Pleszkun,et al.  PIPE: a VLSI decoupled architecture , 1985, ISCA '85.

[6]  Randy H. Katz,et al.  HIGH PERFORMANCE MICROPROCESSOR ARCHITECTURES , 1990 .

[7]  Arthur B. Maccabe Computer Systems: Architecture, Organization, and Programming , 1993 .

[8]  Rajiv Gupta,et al.  Compile-time techniques for efficient utilization of parallel memories , 1988, PPoPP 1988.

[9]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[10]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[11]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[12]  Steven A. Moyer,et al.  Access Ordering and Effective Memory Bandwidth , 1993 .

[13]  David T. Harper,et al.  Increased Memory Performance During Vector Accesses Through the use of Linear Address Transformations , 1992, IEEE Trans. Computers.

[14]  Ivan Sklenar Prefetch unit for vector operations on scalar computers (abstract) , 1992, ISCA '92.

[15]  John P. Hayes,et al.  Computer Architecture and Organization , 1980 .

[16]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[17]  V. Klema LINPACK user's guide , 1980 .

[18]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[19]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[20]  James E. Smith,et al.  The ZS-1 central processor , 1987, ASPLOS 1987.

[21]  Rajiv Gupta,et al.  Compile-time techniques for efficient utilization of parallel memories , 1988, PPEALS '88.

[22]  William A. Wulf,et al.  Evaluation of the WM Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[23]  Andrew R. Pleszkun,et al.  PIPE: a VLSI decoupled architecture , 1985, ISCA '85.

[24]  Ken Kennedy,et al.  Blocking Linear Algebra Codes for Memory Hierarchies , 1989, PPSC.

[25]  Ivan Tomek Foundations of computer architecture and organization , 1990 .

[26]  Jean-Loup Baer,et al.  Computer systems architecture , 1980 .

[27]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[28]  Ivan Sklenár Prefetch unit for vector operations on scalar computers , 1992, CARN.

[29]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).