Translation caching: skip, don't walk (the page table)

This paper explores the design space of MMU caches that accelerate virtual-to-physical address translation in processor architectures, such as x86-64, that use a radix tree page table. In particular, these caches accelerate the page table walk that occurs after a miss in the Translation Lookaside Buffer. This paper shows that the most effective MMU caches are translation caches, which store partial translations and allow the page walk hardware to skip one or more levels of the page table. In recent years, both AMD and Intel processors have implemented MMU caches. However, their implementations are quite different and represent distinct points in the design space. This paper introduces three new MMU cache structures that round out the design space and directly compares the effectiveness of all five organizations. This comparison shows that two of the newly introduced structures, both of which are translation cache variants, are better than existing structures in many situations. Finally, this paper contributes to the age-old discourse concerning the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces. However, these studies did not take into account the possibility of caching page table entries from the higher levels of the radix tree. This paper shows that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table.

[1]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[2]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[3]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[4]  Neal E. Young,et al.  On-line caching as cache size varies , 1991, SODA '91.

[5]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[6]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[7]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[8]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[9]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[10]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[11]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[12]  Mark D. Hill,et al.  A new page table for 64-bit address spaces , 1995, SOSP.

[13]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[14]  Jochen Liedtke,et al.  Address space sparsity and fine granularity , 1994, EW 6.

[15]  Collin McCurdy,et al.  Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[16]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[17]  M. Frans Kaashoek,et al.  Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[18]  Willy Zwaenepoel,et al.  Improving TLB Miss Handling with Page Table Pointer Caches , 1997 .

[19]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.