Resource conscious prefetching for irregular applications in multicores

Many real-world applications exhibit irregular memory access patterns that cannot be handled by stream prefetchers in commodity processors. While it is possible to target irregular accesses by prefetching them in software, doing so requires a low-overhead method that ensures last-level cache and off-chip bandwidth friendly prefetching of useful data. Further, to make such approaches practical, they should ideally not require access to source code. In this work we present a low-overhead software-only method for efficient prefetching of irregular memory access patterns. The method is targeted at commodity multicores and designed to conserve shared last level cache space and off-chip bandwidth. Our approach uses low-overhead runtime sampling and statistical cache modeling to identify irregular loads that frequently miss in the cache. A cost-benefit analysis then identifies the irregular loads that can benefit from prefetching in software. This approach allows us to improve average single thread performance across 10 workloads by 10%, without dramatically increasing the off-chip bandwidth. We evaluate our method on two commodity multicores. Across 210 multi-process runs that utilize a multicore by running several different workloads in parallel, the proposed irregular software prefetching mechanism achieves up to 22% better throughput than hardware prefetching. All workload mixes benefit from our scheme, improving throughput by 9% on average.

[1]  Toshio Nakatani,et al.  Stride prefetching by dynamically inspecting objects , 2003, PLDI '03.

[2]  Mauricio J. Serrano,et al.  Prefetch injection based on hardware monitoring and object metadata , 2004, PLDI '04.

[3]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[4]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[5]  David Eklov,et al.  StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[6]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[7]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[8]  Harish Patil,et al.  Profile-guided post-link stride prefetching , 2002, ICS '02.

[9]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[10]  Weng-Fai Wong,et al.  Compiler orchestrated prefetching via speculation and predication , 2004, ASPLOS XI.

[11]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[12]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[13]  David Black-Schaffer,et al.  Phase guided profiling for fast cache modeling , 2012, CGO '12.

[14]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[15]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[16]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.