Adaptive prefetching for shared cache based chip multiprocessors

Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts.

[1]  David J. Lilja,et al.  A compiler-assisted data prefetch controller , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[2]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[3]  Santosh G. Abraham,et al.  Effective instruction prefetching in chip multiprocessors for modern commercial applications , 2005, 11th International Symposium on High-Performance Computer Architecture.

[4]  Wei-Chung Hsu,et al.  The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System , 2003, MICRO.

[5]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[6]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[7]  Todd C. Mowry,et al.  Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  Gary Lauterbach,et al.  UltraSPARC-III: designing third-generation 64-bit performance , 1999, IEEE Micro.

[9]  Wei-Chung Hsu,et al.  Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[10]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[11]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[12]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[13]  David A. Wood,et al.  Interactions Between Compression and Prefetching in Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[14]  Gary S. Tyson,et al.  A prefetch taxonomy , 2004, IEEE Transactions on Computers.

[15]  Rakesh Krishnaiyer,et al.  An Overview of the Intel® IA-64 Compiler , 1999 .

[16]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[17]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[18]  Todd C. Mowry,et al.  Architectural and compiler support for effective instruction prefetching: a cooperative approach , 2001, TOCS.

[19]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[20]  Youngsoo Choi,et al.  Design and Experience : Using the Intel ® Itanium ® 2 Processor Performance Monitoring Unit to Implement Feedback Optimizations , 2002 .

[21]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[22]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[23]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[24]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[25]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[26]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[27]  Srihari Makineni,et al.  Exploring the cache design space for large scale CMPs , 2005, CARN.

[28]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[29]  Zhen Yang,et al.  Coterminous locality and coterminous group data prefetching on chip-multiprocessors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[30]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[31]  Ando Ki,et al.  Adaptive data prefetching using cache information , 1997, ICS '97.