Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning

Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.

[1]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[2]  Zhiyuan Li,et al.  A sample-based cache mapping scheme , 2005, LCTES '05.

[3]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[4]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[5]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[6]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[7]  Michael Stumm,et al.  Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[8]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[9]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[12]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[13]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[14]  Xiaoning Ding,et al.  MCC-DB: Minimizing Cache Conflicts in Multi-core Processors for Databases , 2009, Proc. VLDB Endow..

[15]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[16]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[18]  Scott A. Mahlke,et al.  Compiler-managed partitioned data caches for low power , 2007, LCTES '07.

[19]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[20]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[22]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[23]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[24]  S. Abraham,et al.  Eecient Simulation of Multiple Cache Conngurations Using Binomial Trees , 1991 .

[25]  Peter F. Sweeney,et al.  Multiple page size modeling and optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[26]  Peter Davies,et al.  The TLB slice-a low-cost high-speed address translation mechanism , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[27]  Qiang Wu,et al.  Exposing memory access regularities using object-relative memory profiling , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..