The non-critical buffer: using load latency tolerance to improve data cache efficiency

Data cache performance is critical to overall processor performance as the latency gap betweem CPU core and main memory increases. Studies have shown that some loads have latency demands that allow them to be serviced from slower portions of memory, thus allowing more critical data to be kept in higher levels of the cache. We provide a strategy for identifying this latency-tolerant data at runtime and, using simple heuristics, keep it out of the main cache and place it instead in a small, parallel, associative buffer. Using such a non-critical buffer dramatically improves the hit rate for more critical data, and leads to a performance improvement comparable to or better than other traditional cache improvement schemes. IPC improvements of over 4% are seen for some benchmarks.

[1]  Edward S. Davidson,et al.  Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[2]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[3]  André Seznec,et al.  Acase For Two-way Skewed-associative Caches , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[4]  Dirk Grunwald,et al.  A comparison of software code reordering and victim buffers , 1999, CARN.

[5]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[6]  Lizy Kurian John,et al.  Design and performance evaluation of a cache assist to implement selective caching , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[7]  Srilatha Manne,et al.  Power and performance tradeoffs using various caching strategies , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[8]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[9]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[10]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[12]  Alvin R. Lebeck,et al.  Exploiting Load Latency Tolerance in Dynamically Scheduled Processors , 1998 .

[13]  A. Argawal,et al.  Cache performance of operating systems and multiprogramming , 1988 .