Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors

Translation Lookaside Buffers (TLBs) are a staple in modern computer systems and have a significant impact on overall system performance. Numerous prior studies have addressed TLB designs to lower access times and miss rates; these, however, have been targeted towards uniprocessor architectures. As the computer industry embraces chip multiprocessor (CMP) architectures, it is important to study the TLB behavior of emerging parallel workloads. This work presents the first full-system characterization of the TLB behavior of emerging parallel applications on real-system CMPs. Using the PARSEC benchmarks, representative of emerging RMS workloads, we show that TLB misses can hinder system performance significantly. We also evaluate TLB miss stream patterns and show that multiple threads of a parallel execution experience a large number of redundant and predictable misses. For our evaluated benchmarks, 30% to 95% of the total misses fall under this category. Our results point to the need for novel TLB designs encouraging inter-core cooperation, either through hierarchically shared TLBs or through inter-core TLB prediction mechanisms.

[1]  Douglas W. Clark,et al.  Performance of the VAX-11/780 translation buffer: simulation and measurement , 1985, TOCS.

[2]  Brian N. Bershad,et al.  The interaction of architecture and operating system design , 1991, ASPLOS IV.

[3]  Brian N. Bershad,et al.  The interaction of architecture and operating system design , 1991, ASPLOS IV.

[4]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[5]  Trevor N. Mudge,et al.  Design Tradeoffs For Software-managed Tlbs , 1994, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[6]  Jerry Huck,et al.  Architectural support for translation table management in large address space machines , 1993, ISCA '93.

[7]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[8]  Mark D. Hill,et al.  Use of superpages and subblocking in the address translation hierarchy , 1995 .

[9]  Anoop Gupta,et al.  The impact of architectural trends on operating system performance , 1995, SOSP.

[10]  Trevor N. Mudge,et al.  Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[11]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[12]  Trevor N. Mudge,et al.  Virtual memory in contemporary microprocessors , 1998, IEEE Micro.

[13]  Michel Dubois,et al.  Options for dynamic address translation in COMAs , 1998, ISCA.

[14]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Anand Sivasubramaniam,et al.  Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks , 2002, SIGMETRICS '02.

[16]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[17]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).