Inter-core cooperative TLB for chip multiprocessors

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is necessary to examine TLB performance in the context of parallel workloads. This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs. We propose and evaluate two Inter-Core Cooperative (ICC) TLB prefetching mechanisms, assessing their effectiveness at eliminating TLB misses both individually and together. Our results show these approaches require at most modest hardware and can collectively eliminate 19% to 90% of data TLB (D-TLB) misses across the surveyed parallel workloads. We also compare performance improvements across a range of hardware and software implementation possibilities. We find that while a fully-hardware implementation results in average performance improvements of 8-46% for a range of TLB sizes, a hardware/software approach yields improvements of 4-32%. Overall, our work shows that TLB prefetchers exploiting inter-core correlations can effectively eliminate TLB misses.

[1]  Trevor Mudge,et al.  Design tradeoffs for software-managed TLBs , 1993, ISCA '93.

[2]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Jerry Huck,et al.  Architectural support for translation table management in large address space machines , 1993, ISCA '93.

[4]  Douglas W. Clark,et al.  Performance of the VAX-11/780 translation buffer: simulation and measurement , 1985, TOCS.

[5]  Brian N. Bershad,et al.  The interaction of architecture and operating system design , 1991, ASPLOS IV.

[6]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[7]  Gary S. Tyson,et al.  A prefetch taxonomy , 2004, IEEE Transactions on Computers.

[8]  Anoop Gupta,et al.  The impact of architectural trends on operating system performance , 1995, SOSP.

[9]  Michel Dubois,et al.  Options for dynamic address translation in COMAs , 1998, ISCA.

[10]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[11]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[12]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[13]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[14]  Trevor N. Mudge,et al.  Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[15]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[16]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[17]  Mark D. Hill,et al.  Use of superpages and subblocking in the address translation hierarchy , 1995 .

[18]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[19]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[20]  Anand Sivasubramaniam,et al.  Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks , 2002, SIGMETRICS '02.

[21]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Trevor N. Mudge,et al.  Virtual memory in contemporary microprocessors , 1998, IEEE Micro.