Accurate and complexity-effective spatial pattern prediction

Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.

[1]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[2]  Se-Hyun Yang,et al.  Near-optimal precharging in high-performance nanoscale CMOS caches , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[5]  Antonio Gonzalez,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.

[6]  Wei-Fen Lin,et al.  Filtering superfluous prefetches using density vectors , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[7]  Babak Falsafi,et al.  Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[8]  K. Roy,et al.  DRG-cache: a data retention gated-ground cache for low power , 2002, Proceedings 2002 Design Automation Conference (IEEE Cat. No.02CH37324).

[9]  장훈,et al.  [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[10]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[11]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[12]  Rajesh K. Gupta,et al.  Adapting cache line size to application behavior , 1999, ICS '99.

[13]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[14]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[15]  Dirk Grunwald,et al.  Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[16]  Walid A. Najjar,et al.  An evaluation of bottom-up and top-down thread generation techniques , 1993, MICRO 1993.

[17]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[18]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[19]  Cezary Dubnicki,et al.  Adjustable Block Size Coherent Caches , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[20]  Yvon Jégou,et al.  Using virtual lines to enhance locality exploitation , 1994, ICS '94.

[21]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[22]  Kaushik Roy,et al.  An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23]  Stefanos Kaxiras,et al.  Improving CC-NUMA performance using Instruction-based Prediction , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[24]  Robert H. Dennard,et al.  CMOS scaling for high performance and low power-the next ten years , 1995, Proc. IEEE.

[25]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[26]  Andreas Moshovos,et al.  Low-leakage asymmetric-cell SRAM , 2002, ISLPED '02.

[27]  Jean-Loup Baer,et al.  Pursuing the performance potential of dynamic cache line sizes , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).