Reducing the Traffic of Loop-Based Programs Using a Prefetch Processor

Large cache block sizes are used to take advantage of spatial locality and amortize long memory latency over more words. However, the cost of large cache block sizes is increased memory traffic requirements, especially for applications that show poor spacial locality. Software prefetching is usually presumed to increase memory traffic. We present an architecture that uses a separate processor devoted to prefetching that improves execution time and at the same time allows the cache block size to be reduced, thereby reducing memory traffic. Simulation results show that our architecture reduces traffic at the microprocessor chip boundary by between 15% and 67% while reducing execution time by up to 68% for eight scientific and signal processing benchmarks.

[1]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[2]  Compilation Techniques,et al.  Parallel architectures and compilation techniques , 1995 .

[3]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[4]  Joseph A. Fisher,et al.  Very Long Instruction Word architectures and the ELI-512 , 1983, ISCA '83.

[5]  Apoorv Srivastava,et al.  A High-Performance, Hierarchical Decoupled Architecture , 1996 .

[6]  Lizy Kurian John,et al.  Memory Latency Effects in Decoupled Architectures , 1994, IEEE Trans. Computers.

[7]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[8]  Alan R. Jones,et al.  Fast Fourier Transform , 1970, SIGP.

[9]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[10]  Gary S. Tyson,et al.  A study of single-chip processor/cache organizations for large numbers of transistors , 1994, ISCA '94.

[11]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12]  Ian Watson,et al.  Decoupled pre-fetching for distributed shared memory , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[13]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[14]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[15]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[16]  Daeyeon Park,et al.  Improving the effectiveness of software prefetching with adaptive executions , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[17]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[18]  D. Munson Circuits and systems , 1982, Proceedings of the IEEE.

[19]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .