Resource-Aware Compiler Prefetching for Many-Cores

Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor lighter cores with less resources. Support for hardware and software prefetch increase MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We show that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% and the state-of-the art GCC implementation by up to 34.79%. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show improvements of up to 24.61%.

[1]  Xuejun Yang,et al.  Improving the Performance of GCC by Exploiting IA-64 Architectural Features , 2005, Asia-Pacific Computer Systems Architecture Conference.

[2]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[3]  Dean M. Tullsen,et al.  Effective cache prefetching on bus-based multiprocessors , 1995, TOCS.

[4]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[5]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[6]  Sanguthevar Rajasekaran,et al.  Handbook of Parallel Computing - Models, Algorithms and Applications , 2007 .

[7]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Gang Qu,et al.  Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing , 2007, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007).

[9]  Uzi Vishkin,et al.  Is teaching parallel algorithmic thinking to high school students possible?: one teacher's experience , 2010, SIGCSE.

[10]  Jiang Zhu,et al.  Building a RCP (Rate Control Protocol) Test Network , 2007 .

[11]  Ken Kennedy,et al.  Compiler support for software prefetching , 1998 .

[12]  Amos R. Omondi,et al.  Advances in Computer Systems Architecture , 2003, Lecture Notes in Computer Science.

[13]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[14]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[15]  George C. Caragea,et al.  Brief announcement: performance potential of an easy-to-program PRAM-on-chip prototype versus state-of-the-art processor , 2009, SPAA '09.

[16]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[17]  Uzi Vishkin,et al.  Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach , 2003, Theory of Computing Systems.

[18]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[19]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[20]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[21]  Uzi Vishkin,et al.  A pilot study to compare programming effort for two parallel programming models , 2007, J. Syst. Softw..

[22]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[23]  George C. Caragea,et al.  General-Purpose vs . GPU : Comparison of Many-Cores on Irregular Workloads , 2010 .

[24]  Gang Qu,et al.  Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing , 2007 .

[25]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.