High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

Increasing transistor density enables adding more on-die cache real-estate However, devoting more space to the shared last-level-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC. Doing so improves average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC. The consequence of increasing the size of private caches is to relax inclusion and build exclusive hierarchies. Thus, for the same total caching capacity, an exclusive cache hierarchy provides better cache access latency. We observe that server workloads benefit tremendously from an exclusive hierarchy with large private caches. This is primarily because large private caches accommodate the large code working-sets of server workloads. For a 16-core CMP, an exclusive cache hierarchy improves server workload performance by 5-12% as compared to an equal capacity inclusive cache hierarchy. The paper also presents directions for further research to maximize performance of exclusive cache hierarchies.

[1]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[3]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[4]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[5]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[6]  Mainak Chaudhuri,et al.  Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[7]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8]  Thomas F. Wenisch,et al.  Practical off-chip meta-data for temporal memory streaming , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[9]  Mohamed M. Zahran,et al.  Non-Inclusion Property in Multi-Level Caches Revisited , 2007, Int. J. Comput. Their Appl..

[10]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[11]  Aamer Jaleel,et al.  Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[13]  Pejman Lotfi Kamran Scale-Out Processors , 2013 .

[14]  Babak Falsafi,et al.  Database Servers on Chip Multiprocessors: Limitations and Opportunities , 2007, CIDR.

[15]  Ying Zheng,et al.  Performance evaluation of exclusive cache hierarchies , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[16]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[17]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[18]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Thomas F. Wenisch,et al.  Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[20]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[21]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[22]  A. Jaleel Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU 2000 and SPEC CPU 2006 Benchmark Suites , 2022 .

[23]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Scott McFarling Cache replacement with dynamic exclusion , 1992, ISCA '92.

[25]  Hyesoon Kim,et al.  FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[26]  M. Zahran Cache Replacement Policy Revisited , 2022 .

[27]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  Glenn Reinman,et al.  Optimizations Enabled by a Decoupled Front-End Architecture , 2001, IEEE Trans. Computers.

[29]  Thomas F. Wenisch,et al.  Temporal streams in commercial server applications , 2008, 2008 IEEE International Symposium on Workload Characterization.

[30]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[31]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[32]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[34]  Katherine E. Fletcher,et al.  Techniques For Reducing the Impact of Inclusion in Shared Network Cache Multiprocessors , 1994 .