An Efficient Hardware Prefetcher Exploiting the Prefetch Potential of Long-Stride Access Pattern on Virtual Address

Large scale computing and applications with large data sets often cause high cache miss rate because of using array of structures or linked-list data structure. When traversing these data structures, the memory accesses may have constant long-strides across pages on virtual address (VA), but mostly scatter over the physical address (PA). Therefore, conventional stride prefetcher (SP) based on PA cannot prefetch data efficiently here. In this paper, we propose a hardware data prefetching design named Virtual Address-based Stride Prefetcher (VASP) to exploit the prefetch potential of long-stride access pattern on VA. VASP detects the access strides on VA including those cross pages, then it predicts a new VA and prefetches data after address translation. We implement VASP in the gem5 simulator and use SPEC CPU2006 integer benchmarks to evaluate its performance. Our simulation results show that, compared with SP, applying VASP to caches offers up to 43% performance improvement in the mcf benchmark, and improves the overall performance by 6%.

[1]  Shih-Lien Lu,et al.  Hardware-based pointer data prefetcher , 2003, Proceedings 21st International Conference on Computer Design.

[2]  Rakesh Krishnaiyer,et al.  Value-Profile Guided Stride Prefetching for Irregular Code , 2002, CC.

[3]  Ronald G. Dreslinski,et al.  Analysis of hardware prefetching across virtual page boundaries , 2007, CF '07.

[4]  David Kaeli,et al.  Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 Architecture , 2006, 2006 IEEE International Symposium on Workload Characterization.

[5]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[6]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[7]  Luis Angel D. Bathen,et al.  Optimal multistream sequential prefetching in a shared cache , 2007, TOS.

[8]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[10]  Sam Ainsworth,et al.  Software prefetching for indirect memory accesses , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11]  David R. O'Hallaron,et al.  Computer Systems: A Programmer's Perspective , 1991 .

[12]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[13]  Pen-Chung Yew,et al.  The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors , 1987 .

[14]  Pejman Lotfi-Kamran,et al.  An Efficient Temporal Data Prefetcher for L1 Caches , 2017, IEEE Computer Architecture Letters.

[15]  Alexander V. Veidenbaum,et al.  Stride-directed prefetching for secondary caches , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).