CQoS: a framework for enabling QoS in shared caches of CMP platforms

Cache hierarchies have been traditionally designed for usage by a single application, thread or core. As multi-threaded (MT) and multi-core (CMP) platform architectures emerge and their workloads range from single-threaded and multithreaded applications to complex virtual machines (VMs), a shared cache resource will be consumed by these different entities generating heterogeneous memory access streams exhibiting different locality properties and varying memory sensitivity. As a result, conventional cache management approaches that treat all memory accesses equally are bound to result in inefficient space utilization and poor performance even for applications with good locality properties. To address this problem, this paper presents a new cache management framework (CQoS) that (1) recognizes the heterogeneity in memory access streams, (2) introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and (3) assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs. To achieve this, we propose CQoS options for priority classification, priority assignment and priority enforcement. We briefly describe CQoS priority classification and assignment options -- ranging from user-driven and developer-driven to compiler-detected and flow-based approaches. Our focus in this paper is on CQoS mechanisms for priority enforcement -- these include (1) selective cache allocation, (2) static/dynamic set partitioning and (3) heterogeneous cache regions. We discuss the architectural design and implementation complexity of these CQoS options. To evaluate the performance trade-offs for these options, we have modeled these CQoS options in a cache simulator and evaluated their performance in CMP platforms running network-intensive server workloads. Our simulation results show the effectiveness of our proposed options and make the case for CQoS in future multi-threaded/multi-core platforms since it improves shared cache efficiency and increases overall system performance as a result.

[1]  Jon Postel,et al.  Transmission Control Protocol , 1981, RFC.

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[3]  Per Stenström,et al.  A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[4]  David J. Lilja,et al.  Combining hardware and software cache coherence strategies , 1991, ICS '91.

[5]  Dean M. Tullsen,et al.  Limitations Of Cache Prefetching On A Bus-based Multiprocessor , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[6]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[7]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Basem A. Nayfeh,et al.  The impact of shared-cache clustering in small-scale shared-memory multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[9]  Pen-Chung Yew,et al.  Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors , 1996, J. Parallel Distributed Comput..

[10]  Josep Torrellas,et al.  Data Forwarding in Scalable Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[11]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[12]  François Bodin,et al.  Skewed Associativity Improves Program Performance and Enhances Predictability , 1997, IEEE Trans. Computers.

[13]  Vijay S. Pai,et al.  The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[14]  André Seznec Decoupled Sectored Caches , 1997, IEEE Trans. Computers.

[15]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[16]  David Clark,et al.  An analysis of TCP processing overhead , 1989 .

[17]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[18]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[19]  David D. Clark,et al.  An analysis of TCP processing overhead , 1988, IEEE Communications Magazine.

[20]  Tal Garfinkel,et al.  Terra: a virtual machine-based platform for trusted computing , 2003, SOSP '03.

[21]  S. Makineni,et al.  Performance characterization of TCP/IP packet processing in commercial server workloads , 2003, 2003 IEEE International Conference on Communications (Cat. No.03CH37441).

[22]  Samuel T. King,et al.  Operating System Support for Virtual Machines , 2003, USENIX Annual Technical Conference, General Track.

[23]  Ravi R. Iyer On modeling and analyzing cache hierarchies using CASPER , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[24]  Milo M. K. Martin,et al.  Token Coherence: A New Framework for Shared-Memory Multiprocessors , 2003, IEEE Micro.

[25]  Srihari Makineni,et al.  Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[26]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[27]  Srinivas Devadas,et al.  Controlling Cache Pollution in Prefetching With Software-assisted Cache Replacement , 2005 .