论文信息 - Combining Process-Based Cache Partitioning and Pollute Region Isolation to Improve Shared Last Level Cache Utilization on Multicore Systems

Combining Process-Based Cache Partitioning and Pollute Region Isolation to Improve Shared Last Level Cache Utilization on Multicore Systems

Shared last level cache has been widely used in modern multicore processors. However, uncontrolled cache sharing on multicore leads to more serious cache pollution than that on single-core processor. A process with weak locality can evict strong locality data sets that belong to other concurrent ones. Processes in multiprocessing environment always affect each other on multicore systems with shared last level cache. Prior approaches either partition shared cache in process level to reduce inter-process cache contention, or isolate the non-temporal memory accesses in order to accelerate single application execution. Process-based cache partitioning may make intra-process cache pollution more serious and have great impact on single process performance. In this work, we take an alternative view to explore physical page layout optimization by combining process-based cache partitioning and pollute region isolation for improving the shared last level cache utilization on multicore systems. Our proposed approach includes three steps. The first step determines the cache sizes of co-scheduled applications and the second step recognizes weak-locality regions of each application on different cache size configurations. Lastly, the third step customizes the physical page layout to partition cache space among concurrent processes and set up global pollute buffer for mapping pollute regions into a small slice of shared last level cache. Our approach is directly used in commercial multicore systems without any additional hardware requirement. Our experimental results show that in comparison with default Linux memory management scheme, our approach improves performance by 26.73% on average. Even compared to the process-based cache partitioning RapidMRC, our approach further eliminates the harmful effect of non-reusable data, and system performance is also improved by 5.63% on average.

[1] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[2] Gary S. Tyson,et al. Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[3] Zhao Zhang,et al. Enabling software management for multicore caches with a lightweight hardware support , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4] Tao Huang,et al. Reducing last level cache pollution through OS-level software-controlled region-based partitioning , 2012, SAC '12.

[5] Emery D. Berger,et al. CRAMM: virtual memory support for garbage-collected applications , 2006, OSDI '06.

[6] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[7] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[8] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[9] S. Kim,et al. Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[10] Michael Stumm,et al. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[11] Yuanyuan Zhou,et al. The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[12] Michael Stumm,et al. Enhancing operating system support for multicore processors by using hardware performance monitoring , 2009, OPSR.

[13] Yan Solihin,et al. Counter-based cache replacement algorithms , 2005, 2005 International Conference on Computer Design.

[14] Sang Lyul Min,et al. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references , 2000, OSDI.

[15] Michael Stumm,et al. Path: page access tracking to improve memory management , 2007, ISMM '07.

[16] Aamer Jaleel,et al. Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17] John L. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[18] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19] Laszlo A. Belady,et al. A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[20] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[21] Zhang Min-xuan. A Parallel Stream Memory Architecture for Heterogeneous Multi-core Processor , 2009 .

[22] Xiaoning Ding,et al. ULCC: a user-level facility for optimizing shared cache performance on multicores , 2011, PPoPP '11.

[23] David K. Tam,et al. Managing Shared L2 Caches on Multicore Systems in Software , 2007 .

[24] Chen Ding,et al. On the theory and potential of LRU-MRU collaborative cache management , 2011, ISMM '11.

[25] Michael Stumm,et al. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[26] Zhao Zhang,et al. Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[27] Richard E. Kessler,et al. Page placement algorithms for large real-indexed caches , 1992, TOCS.