Interference Aware Cache Designs for Operating System Execution

Large-scale chip multiprocessors will likely be heterogeneous. It has been suggested by several groups that it may be worthwhile to implement some cores that are specially tuned to execute common code patterns. One such common application that will execute on all future processors is of course the operating system. Many future workloads will spend a large fraction of their execution time within privileged mode, either executing system calls or pure operating system functionality. Vast transistor budgets and relatively low on-chip communication latencies make it feasible to off-load the execution of privileged instruction sequences on to such a custom core. In this paper, we first examine this off-load approach and attempt to understand its benefits. We then try to architect a solution that captures the benefits of off-loading and eliminates its disadvantages. In essence, the benefits of offloading can be attributed to reduced cache interference, while its disadvantages are the high latency costs for off-load and cache coherence. Our proposed solution employs a special OS cache per core and improves performance by up to 18% for OS-intensive workloads without any significant addition of transistors. We consider several design choices for this OS cache and argue that it is a better use of transistor and power budget than the off-loading approach when both adding to the transistor budget or leaving it unchanged.

[1]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[2]  Brian N. Bershad,et al.  The interaction of architecture and operating system design , 1991, ASPLOS IV.

[3]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[4]  M.D. Smith,et al.  An Analysis of Dynamic Branch Prediction Schemes on System Workloads , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[5]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[6]  Susan J. Eggers,et al.  An analysis of operating system behavior on a simultaneous multithreaded architecture , 2000, ASPLOS IX.

[7]  Narayanan Vijaykrishnan,et al.  Understanding and improving operating system effects in control flow prediction , 2002, ASPLOS X.

[8]  Rajeev Balasubramonian,et al.  Dynamically managing the communication-parallelism trade-off in future clustered processors , 2003, ISCA '03.

[9]  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[10]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[11]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[12]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[13]  Erik Brunvand A case for increased operating system support in chip multi-processors , 2005 .

[14]  Ryan E. Grant,et al.  Power-performance efficiency of asymmetric multiprocessors for multi-threaded scientific applications , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[16]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[17]  Lizy Kurian John,et al.  Operating system power minimization through run-time processor resource adaptation , 2006, Microprocess. Microsystems.

[18]  Lizy Kurian John,et al.  OS-aware tuning: improving instruction cache energy efficiency on system workloads , 2006, 2006 IEEE International Performance Computing and Communications Conference.

[19]  Patrick Crowley,et al.  Network I/O Acceleration in Heterogeneous Multicore Processors , 2006, 14th IEEE Symposium on High-Performance Interconnects (HOTI'06).

[20]  S. Tam,et al.  A 65-nm Dual-Core Multithreaded Xeon® Processor With 16-MB L3 Cache , 2007, IEEE Journal of Solid-State Circuits.

[21]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[22]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Dean M. Tullsen,et al.  The shared-thread multiprocessor , 2008, ICS '08.

[25]  Vanish Talwar,et al.  Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems , 2008, IEEE Micro.