Compiler managed micro-cache bypassing for high performance EPIC processors

Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz boundary. For such high performance microprocessors, a small and fast data micro cache (ucache) is important to overall performance, and proper management of it via load bypassing has a significant performance impact. In this paper, we propose and evaluate a hardware-software collaborative technique to manage ucache bypassing for EPIC processors. The hardware supports the ucache bypassing with a flag in the load instruction format, and the compiler employs static analysis and profiling to identify loads that should bypass the ucache. The collaborative method achieves a significant improvement in performance for the SpecInt2000 benchmarks. On average, about 40%, 30%, 24%, and 22% of load references are identified to bypass 256B, 1K, 4K, and 8K sized ucaches, respectively. This reduces the ucache miss rates by 39%, 32%, 28%, and 26%. The number of pipeline stalls from loads to their uses is reduced by 13%, 9%, 6%, and 5%. Meanwhile, the L1 and L2 cache misses remain largely unchanged. For the 256B ucache, bypassing improves overall performance on average by 5%.

[1]  Scott Mahlke,et al.  Sentinel scheduling: a model for compiler-controlled speculative execution , 1993 .

[2]  Chris Wilkerson,et al.  Locality vs. criticality , 2001, ISCA 2001.

[3]  Kishore N. Menezes,et al.  Wavefront scheduling: path based data representation and scheduling of subgraphs , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[4]  Shai Rubin,et al.  Focusing processor policies via critical-path prediction , 2001, ISCA 2001.

[5]  John Paul Shen,et al.  Non-vital loads , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[6]  Keshav Pingali,et al.  Dependence-based program analysis , 1993, PLDI '93.

[7]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[8]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[9]  P. E. Livadas,et al.  Program dependence analysis , 1992, Proceedings Conference on Software Maintenance 1992.

[10]  Richard Uhlig,et al.  SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture , 1999 .

[11]  B. Ramakrishna Rau,et al.  EPIC: Explicititly Parallel Instruction Computing , 2000, Computer.

[12]  Guang R. Gao,et al.  Speculative Prefetching of Induction Pointers , 2001, CC.

[13]  Youfeng Wu,et al.  Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching , 2002, PLDI '02.

[14]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[15]  Gary S. Tyson,et al.  Active Management of Data Caches by Exploiting Reuse Information , 1999, IEEE Trans. Computers.

[16]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[17]  Gary S. Tyson,et al.  Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[18]  Nasser A. Kurd,et al.  A multigigahertz clocking scheme for the Pentium(R) 4 microprocessor , 2001, IEEE J. Solid State Circuits.

[19]  Henry G. Dietz,et al.  Improving cache performance by selective cache bypass , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[20]  R. Iris Bahar,et al.  The non-critical buffer: using load latency tolerance to improve data cache efficiency , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[21]  Rastislav Bodík,et al.  Slack: maximizing performance under technological constraints , 2002, ISCA.

[22]  Brad Calder,et al.  Dynamic prediction of critical path instructions , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.