Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. We propose a new technique, which is based on the wrong-path execution of loads far beyond instruction fetch-limiting conditional branches, to exploit more instruction-level parallelism by reducing the impact of memory delays. We examine the effects of the execution of loads down the wrong branch path on the performance of an aggressive issue processor. We find that, by continuing to execute the loads issued in the mispredicted path, even after the branch is resolved, we can actually reduce the cache misses observed on the correctly executed path. This wrong-path execution of loads can result in a speedup of up to 5% due to an indirect prefetching effect that brings data or instruction blocks into the cache for instructions subsequently issued on the correctly predicted path. However, it also can increase the amount of memory traffic and can pollute the cache. We propose the Wrong Path Cache (WPC) to eliminate the cache pollution caused by the execution of loads down mispredicted branch paths. For the configurations tested, fetching the results of wrong path loads into a fully associative 8-entry WPC can result in a 12% to 39% reduction in L1 data cache misses and in a speedup of up to 37%, with an average speedup of 9%, over the baseline processor.

[1]  John Flynn,et al.  Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research , 2001 .

[2]  Per Stenström,et al.  Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[3]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[4]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[5]  Glenn Reinman,et al.  Fetch directed instruction prefetching , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[6]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[7]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[8]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[9]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[10]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[11]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[12]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[13]  Ravi Pendse,et al.  Selective prefetching: prefetching when only required , 1999, 42nd Midwest Symposium on Circuits and Systems (Cat. No.99CH36356).

[14]  Trevor N. Mudge,et al.  Wrong-path instruction prefetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[15]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[16]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[17]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[18]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[19]  Brad Calder,et al.  Instruction recycling on a multiple-path processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[20]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[21]  Yale N. Patt,et al.  A comparison of dynamic branch predictors that use two levels of branch history , 1993, ISCA '93.

[22]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[23]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS '90.

[24]  Jean-Loup Baer,et al.  Instruction cache fetch policies for speculative execution , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[25]  Trevor N. Mudge,et al.  The effect of speculative execution on cache performance , 1994, Proceedings of 8th International Parallel Processing Symposium.

[26]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[27]  Glenn Reinman,et al.  Predictive techniques for aggressive load speculation , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[28]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[29]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[30]  B. Calder,et al.  A scalable front-end architecture for fast instruction delivery , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[31]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[32]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[33]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[34]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.