DStride: data-cache miss-address-based stride prefetching scheme for multimedia processors

Prefetching reduces cache miss latency by moving data up in memory hierarchy before they are actually needed. Recent hardware-based stride prefetching techniques mostly rely on the processor pipeline information (e.g. program counter and branch prediction table) for prediction. Continuing developments in processor microarchitecture drastically change core pipeline design and require that existing hardware-based stride prefetching techniques be adapted to the evolving new processor architectures. In this paper we present a new hardware-based stride prefetching technique, called DStride, that is independent of processor pipeline design changes. In this new design, the first-level data cache miss address stream is used for the stride prediction. The miss addresses are separated into load stream and store stream to increase the efficiency of the predictor. They are checked separately against the recent miss address stream to detect the strides. The detected steady strides are maintained in a table that also performs look-ahead stride prefetching when the processor stride reference rate is higher than the prefetch request service rate. We evaluated our design with multimedia workloads using execution-driven simulation with SimpleScalar toolset. Our experiments show that DStride is very effective in reducing overall pipeline stalls due to cache miss latency, especially for stride-intensive applications such as multimedia workloads.

[1]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[2]  Theo Ungerer,et al.  MPEG-2 video decompression on simultaneous multithreaded multimedia processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[3]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[4]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[5]  Alan Jay Smith,et al.  CPU Cache Prefetching: Timing Evaluation of Hardware Implementations , 1998, IEEE Trans. Computers.

[6]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1999, IEEE Trans. Computers.

[7]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[8]  Shlomit S. Pinter,et al.  Tango: a hardware-based data prefetching technique for superscalar processors , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[9]  Per Stenström,et al.  Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[10]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[11]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12]  Michael J. Flynn,et al.  A comparison of hardware prefetching techniques for multimedia benchmarks , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[13]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO 1992.

[14]  Manoj Franklin,et al.  Control flow prediction with tree-like subgraphs for superscalar processors , 1995, MICRO 1995.