Cache miss heuristics and preloading techniques for general-purpose programs

Previous research on hiding memory latencies has tended to focus on regular numerical programs. This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs. By assuming a lock-up free cache and instruction score-boarding, our technique 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency. We have developed simple compiler heuristics to identify load instructions that are likely to cause a cache-miss. Experimentation with a set of SPEC92 benchmarks shows that our heuristics are successful in identifying 85% of cache misses. We have also developed an algorithm that flexibly schedules the selected load instruction and instructions that use the loaded data to hide memory latency. Our simulation suggests that our technique is successful in hiding memory latency and improves the overall performance.

[1]  Donald J. Hatfield,et al.  Program Restructuring for Virtual Memory , 1971, IBM Syst. J..

[2]  Ken Kennedy,et al.  Blocking Linear Algebra Codes for Memory Hierarchies , 1989, PPSC.

[3]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[4]  Anant Agarwal,et al.  On-Chip Instruction Caches for High Performance Processors, , 1987 .

[5]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[6]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[7]  Mikko H. Lipasti,et al.  Architecture-compatible code boosting for performance enhancement of the IBM RS/6000 , 1993, Proceedings of 1993 IEEE International Conference on Computer Design ICCD'93.

[8]  Stephen J. Hartley Compile-Time Program Restructuring in Multiprogrammed Virtual Memory Systems , 1988, IEEE Trans. Software Eng..

[9]  Youfeng Wu Ordering functions for improving memory reference locality in a shared memory multiprocessor system , 1992, MICRO 1992.

[10]  Santosh G. Abraham,et al.  Iteration Partitioning for Resolving Stride Conflicts on Cache-Coherent Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[11]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[12]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[13]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[14]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[15]  Randall R. Heisch Trace-directed program restructuring for AIX executables , 1994, IBM J. Res. Dev..

[16]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[17]  Apostolos Dollas,et al.  Predicting and precluding problems with memory latency , 1994, IEEE Micro.

[18]  Walid A. Najjar,et al.  An evaluation of bottom-up and top-down thread generation techniques , 1993, MICRO 1993.

[19]  Shlomit S. Pinter,et al.  Compile time instruction cache optimizations , 1994, CARN.

[20]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[21]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[22]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[23]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[24]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[25]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[26]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[27]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[28]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[29]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[30]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[31]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[32]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[33]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[34]  Michel Dubois,et al.  Concurrent Miss Resolution in Multiprocessor Caches , 1988, ICPP.

[35]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[36]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[37]  Rajiv Gupta,et al.  Improving instruction cache behavior by reducing cache pollution , 1990, Proceedings SUPERCOMPUTING '90.

[38]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[39]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[40]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[41]  Rajiv Gupta,et al.  Predictability of load/store instruction latencies , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[42]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[43]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[44]  Domenico Ferrari,et al.  Improving locality by critical working sets , 1974, CACM.

[45]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[46]  M. K. Farrens,et al.  Improving performance of small on-chip instruction caches , 1989, ISCA '89.

[47]  Steven A. Przybylski,et al.  Cache and memory hierarchy design: a performance-directed approach , 1990 .

[48]  Scott A. Mahlke,et al.  Tolerating First Level Memory Access Latency in High-Performance Systems , 1992, ICPP.

[49]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[50]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.