Reducing cache misses using hardware and software page placement

As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this paper we examine the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses. The software approach provides a color mapping at compile-time for code and data pages, which can then be used by the operating system to guide its allocation of physical pages. The hardware approach works by adding a page remap field to the TLB, which is used to allow a page to be remapped to a different color in the physically indexed cache while keeping the same physical page in memory. The results show that software page placement provided a 28% speedup and hardware page placement provided a 21% speedup on average for a superscalar processor. For a 4 processor single-chip multiprocessor, the miss rate was reduced from 8.7% down to 7.2% on average.

[1]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[2]  Richard E. Kessler,et al.  Inexpensive Implementations Of Set-Associativity , 1989, The 16th Annual International Symposium on Computer Architecture.

[3]  Wen-mei W. Hwu,et al.  Achieving High Instruction Cache Performance With An Optimizing Compiler , 1989, The 16th Annual International Symposium on Computer Architecture.

[4]  Karl Pettis,et al.  Profile guided code positioning , 1990, PLDI '90.

[5]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[6]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[7]  Richard Eugene Kessler Analysis of multi-megabyte secondary CPU cache memories , 1992 .

[8]  David R. Cheriton,et al.  Application-controlled physical memory using external page-cache management , 1992, ASPLOS V.

[9]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[10]  Brian N. Bershad,et al.  Dynamic Page Mapping Policies for Cache Conflict Resolution on Standard Hardware , 1994, OSDI.

[11]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[12]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[13]  ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.

[14]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[15]  Yoji Yamada,et al.  Data relocation and prefetching for programs with large data sets , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[17]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[18]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[20]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[21]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[22]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  Nikil D. Dutt,et al.  Memory data organization for improved cache performance in embedded processor applications , 1997, TODE.

[24]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[25]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[26]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[27]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[28]  K. Ghose,et al.  Analytical energy dissipation models for low power caches , 1997, Proceedings of 1997 International Symposium on Low Power Electronics and Design.

[29]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[30]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[31]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[32]  Jih-Kwon Peir,et al.  Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[33]  Kanad Ghose,et al.  ENERGY EFFICIENT CACHE ORGANIZATIONS FOR SUPERSCALAR PROCESSORS , 1998 .

[34]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.