Characterization and dynamic mitigation of intra-application cache interference

Given the emerging dominance of CMPs, an important research problem concerns application memory performance in the face of deep memory hierarchies, where one or more caches are shared by several cores. In current systems, many factors can cause interference in the shared last-level cache (LLC). While predicting an application's memory performance is difficult enough in an idealized setup, it becomes even more complicated in real-machine environments in which interference can stem from operating system memory accesses, and even from an application's own prefetch requests and page table walks caused by TLB misses. This paper characterizes the degree by which intra-application interference factors such as page table walks and hardware prefetching influence performance. Using hardware performance counters on an Intel platform, we first characterize real-system LLC interference and show that application data memory references represent much less than half of the LLC misses, with hardware prefetching and page table walks causing considerable LLC interference. Based on these characterizations, we propose dynamic management methods to reduce intra-application interference. First, we evaluate a dynamic OS-reference-aware cache insertion policy that reduces interference and improves user IPCs by as much as 19% (5% on average). Second, to mitigate prefetch-induced LLC interference, we propose, implement, and evaluate an automatic prefetch manager that uses Intel PEBS capabilities to dynamically estimate prefetch-induced interference and accordingly adjust the aggressiveness of hardware prefetchers as programs run. Overall, our characterizations are important in highlighting the challenges of intra-application interference, and our hardware and software proposals offer significant solutions for addressing them.

[1]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[2]  Xiaotong Zhuang,et al.  Reducing Cache Pollution via Dynamic Data Prefetch Filtering , 2007, IEEE Transactions on Computers.

[3]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[4]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[5]  Santosh G. Abraham,et al.  Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[6]  Calvin Lin,et al.  Memory Prefetching Using Adaptive Stream Detection , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[7]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[8]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  James E. Smith,et al.  Virtual private caches , 2007, ISCA '07.

[10]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[11]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[12]  Mahmut T. Kandemir,et al.  Adaptive set pinning: managing shared caches in chip multiprocessors , 2008, ASPLOS.

[13]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[14]  K.J. Nesbit,et al.  AC/DC: an adaptive data cache prefetcher , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[15]  Li Zhao,et al.  CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[16]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[17]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[18]  Josep Torrellas,et al.  Characterizing the caching and synchronization performance of a multiprocessor operating system , 1992, ASPLOS V.

[19]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[20]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[21]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..