Increasing Memory Bandwidth for Vector Computations

Memory bandwidth is rapidly becoming the performance bottleneck in the application of high performance micro- processors to vector-like algorithms, including the "Grand Challenge" scientific problems. Caching is not the sole solution for these applications due to the poor temporal and spatial locality of their data accesses. Moreover, the nature of memories themselves has changed. Achieving greater bandwidth requires exploiting the characteristics of memory components "on the other side of the cache" - they should not be treated as uniform access-time RAM. This paper describes the use of hardware-assisted access ordering, a technique that combines compile-time detection of memory access patterns with a memory subsystem that decouples the order of requests generated by the processor from that issued to the memory system. This decoupling permits the requests to be issued in an order that optimizes use of the memory system. Our simulations show significant speedup on important scientific ker- nels.

[1]  V. Klema LINPACK user's guide , 1980 .

[2]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[3]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[4]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[5]  Eduard Ayguadé,et al.  Increasing the number of strides for conflict-free vector access , 1992, ISCA '92.

[6]  Wm. A. Wulf Evaluation of the WM architecture , 1992, ISCA '92.

[7]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[8]  Arthur B. Maccabe Computer Systems: Architecture, Organization, and Programming , 1993 .

[9]  King Lee On the Floating Point Performance of the I860TM Microprocessor , 1992, Int. J. High Speed Comput..

[10]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[11]  Steven A. Moyer,et al.  Access Ordering and Effective Memory Bandwidth , 1993 .

[12]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Steven A. Moyer,et al.  Performance of the IPSC/860 Node Architecture , 1991 .

[14]  Ivan Sklenar Prefetch unit for vector operations on scalar computers (abstract) , 1992, ISCA '92.

[15]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[16]  Maccabe Computer Systems , 1993 .

[17]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[18]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[19]  STEPHEN K. JONES,et al.  Optimization and Simulation of Two Classes of Nonresetting Data Reconstructors , 1971, IEEE Transactions on Computers.

[20]  Rajiv Gupta,et al.  Compile-time techniques for efficient utilization of parallel memories , 1988, PPEALS '88.

[21]  Allen D. Malony,et al.  Performance prediction of loop constructs on multiprocessor hierarchical-memory systems , 1989, ICS '89.

[22]  T. H. Meyer Computer Architecture and Organization , 1982 .

[23]  James E. Smith,et al.  The ZS-1 central processor , 1987, ASPLOS.

[24]  M. Morris Mano,et al.  Computer system architecture , 1982 .

[25]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[26]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[27]  Paul Budnik,et al.  The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[28]  Gene H. Golub,et al.  Scientific computing: an introduction with parallel computing , 1993 .

[29]  Ivan Tomek Foundations of computer architecture and organization , 1990 .

[30]  Ivan Sklenár Prefetch unit for vector operations on scalar computers , 1992, CARN.

[31]  B. Parasuraman High-performance microprocessor architectures , 1976, Proceedings of the IEEE.

[32]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[33]  R. J. Chevance,et al.  An evaluation methodology for microprocessor and system architecture , 1992, CARN.

[34]  Andrew R. Pleszkun,et al.  PIPE: a VLSI decoupled architecture , 1985, ISCA '85.

[35]  Steven J. Wallach The CONVEX C-1 64-bit Supercomputer , 1986, COMPCON.

[36]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[37]  Ken Kennedy,et al.  Blocking Linear Algebra Codes for Memory Hierarchies , 1989, PPSC.

[38]  David T. Harper,et al.  Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.