Exploiting Application-Level Information to Reduce Memory Bandwidth Consumption

As processors continue to deliver higher levels of performance and as memory latency tolerance techniques become widespread to address the increasing cost of accessing memory, memory bandwidth will emerge as a major performance bottleneck. Rather than rely solely on wider and faster memories to address memory bandwidth shortages, an alternative is to use existing memory bandwidth more eciently. A promising approach is hardware-based selective subblocking [12, 1]. In this technique, hardware predictors track the portions of cache blocks that are referenced by the processor. On a cache miss, the predictors are consulted and only previously referenced portions are fetched into the cache, thus conserving memory bandwidth. This paper proposes a software-centric (and hence more complexity-eective) approch to selective subblocking. We make the key observation that wasteful data fetching inside long cache blocks arises due to certain sparse memory references, and that such memory references can be identied in the application source code. Rather than use hardware predictors to discover sparse memory reference patterns from the dynamic memory reference stream, our approach relies on the programmer or compiler to identify the sparse memory references statically, and to use special annotated memory instructions to specify the amount of spatial reuse associated with such memory references. At runtime, the size annotations select the amount of data to fetch on each cache miss, thus fetching only data that will likely be accessed by the processor. Our results show annotated memory instructions remove between 54% and 71% of cache trac for 7 applications, reducing more trac than hardware selective sub-blocking using a 32 Kbyte predictor on all applications, and reducing as much trac as hardware selective sub-blocking using an 8 Mbyte predictor on 5 out of 7 applications. Overall, annotated memory instructions achieve a 17% performance gain when used alone, and a 22.3% performance gain when combined with software prefetching, compared to a 7.2% performance degradation when prefetching without annotated memory instructions.

[1]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[2]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[3]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[4]  Larry Rudolph,et al.  Creating a wider bus using caching techniques , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[5]  Sharon E. Perl,et al.  Studies of Windows NT performance using dynamic execution traces , 1996, OSDI '96.

[6]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[7]  Ken Kennedy,et al.  The memory of bandwidth bottleneck and its amelioration by a compiler , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[8]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[9]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[10]  von Hanxledenreinhard D Newsletter #9 Handling Irregular Problems with Fortran D | a Preliminary Report Handling Irregular Problems with Fortran D | a Preliminary Report , 1993 .

[11]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[12]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[13]  J.W.C. Fu,et al.  Data prefetching in multiprocessor vector cache memories , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[14]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[15]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[16]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[17]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[18]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[20]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[21]  Olivier Temam,et al.  Software assistance for data caches , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[22]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[23]  Kazuaki Murakami,et al.  Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[24]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[25]  James R. Goodman,et al.  Hardware techniques to improve the performance of the processor/memory interface , 1998 .

[26]  Eduardo Sanchez,et al.  A Study of a Simultaneous Multithreaded Processor Implementation , 1999, Euro-Par.

[27]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[28]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.