Dynamic Access Ordering for Streamed Computations

Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of special purpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[3]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[4]  Kevin P. McAuliffe,et al.  Automatic Management of Programmable Caches , 1988, ICPP.

[5]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[6]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[7]  AyguadéEduard,et al.  Increasing the number of strides for conflict-free vector access , 1992 .

[8]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[9]  Bruce R. Childers,et al.  Memory bandwidth optimizations for wide-bus machines , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[10]  Mateo Valero,et al.  Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[11]  José María Llabería,et al.  Access order to avoid inter-vector-conflicts in complex memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[12]  Jack W. Davidson,et al.  Memory access coalescing: a technique for eliminating redundant memory accesses , 1994, PLDI '94.

[13]  Ivan Sklenár Prefetch unit for vector operations on scalar computers , 1992, CARN.

[14]  David R. Cheriton,et al.  Software-Controlled Caches in the VMP Multiprocessor , 1986, ISCA.

[15]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[16]  Eduard Ayguadé,et al.  Increasing the number of strides for conflict-free vector access , 1992, ISCA '92.

[17]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[18]  Richard E. Hank,et al.  An efficient architecture for loop based data preloading , 1992, MICRO 1992.

[19]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[20]  Q. S. Gao The Chinese remainder theorem and the prime memory system , 1993, ISCA '93.

[21]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  Ivan Sklenar Prefetch unit for vector operations on scalar computers (abstract) , 1992, ISCA '92.

[23]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[24]  H. Levy,et al.  An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[25]  David Blythe,et al.  System Support for OpenGL Direct Rendering , 1995 .

[26]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[27]  James R. Goodman,et al.  The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[28]  Sally A. McKee,et al.  Memory system support for image processing , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[29]  Steven A. Moyer,et al.  Access Ordering and Effective Memory Bandwidth , 1993 .

[30]  Richard Uhlig,et al.  Using Lookahead to reduce memory bank contention for decoupled operand references , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[31]  Norman P. Jouppi,et al.  Memory-System Design Considerations for Dynamically-Scheduled Processors , 1997, ISCA.

[32]  David E. Culler,et al.  Design challenges of virtual networks: fast, general-purpose communication , 1999, PPoPP '99.

[33]  Sally A. McKee,et al.  Access ordering and memory-conscious cache utilization , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[34]  David R. Cheriton,et al.  Software-controlled caches in the VMP multiprocessor , 1986, ISCA 1986.

[35]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[36]  Scott A. Mahlke,et al.  An efficient architecture for loop based data preloading , 1992, MICRO.

[37]  Scott A. Mahlke,et al.  Tolerating data access latency with register preloading , 1992, ICS '92.

[38]  Richard Crisp,et al.  Direct RAMbus technology: the new main memory standard , 1997, IEEE Micro.

[39]  Sally A. McKee,et al.  Access order and effective bandwidth for streams on a Direct Rambus memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[40]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[41]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[42]  David T. Harper,et al.  Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.

[43]  Martin Walker,et al.  A Shared Memory MPP from Cray Research , 1994, Digit. Tech. J..

[44]  Sally A. McKee,et al.  Maximizing memory bandwidth for streamed computations , 1996 .

[45]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[46]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[47]  Tzi-cker Chiueh,et al.  Sunder: a programmable hardware prefetch architecture for numerical loops , 1994, Proceedings of Supercomputing '94.

[48]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[49]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[50]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.