Guarded page tables on MIPS R4600

The introduction of 64-bit microprocessors has increased demands placed on virtual memory systems. The availability of large address spaces has led to a urry of new applications and operating systems that further stress virtual memory systems. Consequently, much interest has recently focussed on translation lookaside bu er (TLB) performance and page table e ciency. Guarded Page Tables are a mechanism for overcoming some of the problems associated with conventional page tables. Guarded Page Tables are tree structured like conventional page tables. Also like conventional pages tables, they have the advantages of supporting hierarchical operations and sharing of sub-trees. Unlike conventional page tables, guarded page tables implement huge sparsely occupied address spaces e ciently. We describe guarded page tables and the associated parsing algorithm. R4600 processor dependent micro-optimisation is undertaken and presented. R4600 TLB re ll is discussed in detail, including a comparison of guarded page tables with more convention page tables. A software second level TLB is introduced and analysed as a way of increasing guarded page table performance. 1 Rationale The advent of generally available 64-bit machines like the R4600[7] and DEC Alpha[14] has led to researchers proposing new ways to use virtual memory that previously was restricted by virtual address space size. Single address space systems such as Angel[12], Mungi[5], Nemisis[4], and Opal[2] and persistent operating systems such as Grasshopper [3] resulted from wide address space availability. These systems, other proposals[1], and large UNIX style address spaces have increased VM demands in comparison to the traditional UNIX virtual memory model modern processors are designed to support. This has generated much interest in how to best support 64-bit address spaces[17, 16, 6, 15, 13]. This technical report explores the implementation of one proposed mechanism|Guarded Page Tables. This work was originated as part of the Mungi [5] project at UNSW. Since it makes heavy use of a sparsely-occupied address space, the VM system must be targeted to support sparsity e ciently. We selected the Guarded Page Table mechanism (see section 2) which combines the advantages of multi-level and inverted page tables. The critical point was whether the GPT mechanism could be implemented e ciently on the R4600 processor. Therefore, we developed R4600speci c GPT parsing algorithms (section 3) and complemented them with a second-level software TLB (section 5). How to best combine the elements, depends on both the concrete memory system (cache and memory timing) and the TLB-miss characteristics of the OS and applications. Therefore, we include a detailed performance discussion and make the software available as a tool box. The purpose of this technical report is threefold: 1. It o ers a tool box for experimenting with Guarded Page Tables on the R4600 processor. 2. It can be used as a guide for implementing Guarded Page Tables on other processors that support software-controlled TLBs. 3. Independent of the concrete problem, section 3 can serve as an example of architecture-dependent micro optimisation. An interesting result is that about 2/3 of the optimisation process { though architecturedependent { can be made in terms of a high-level language and are based on algorithmic and data structure optimisations. The example shows that substantial performance gains (factors of 2.5 or more) are achievable by combining this method with speci c assembler-level optimisations where general automatic code optimisation techniques do not help. 1 2 Guarded Page Tables Guarded Page Tables have been described in [9, 10]. They combine the advantages of tree-structured multi-level page tables and hashed page tables: unlimited sparsity (2 page table entries per mapped page are always su cient), tree structure (subtree sharing, hierarchical operations) and multiple page sizes. These properties are described in more detail in [8, 11]. Here we give only a short sketch of the basic mechanism. The main problem with multilevel page tables is sparsity: we need huge amounts of page table entries for non-mapped pages. In the following example, the mapping of page 11 10 11 00 in a sparsely occupied address space is shown. (For demonstration purposes we use very small addresses and small page tables. Nil pointers are marked by \ ".) The secondand third-level page table are extremely sparse page tables: each contains one single non-nil entry. Consequently, there is only one valid path through these two tables: when the leftmost two bits are \11", the subsequent address bits must be \10 11"; all other addresses lead to page faults. As shown in Figure 1, we can omit the two page tables and skip the associated translation steps. v = 1110 11 00xxx QQs ? QQs data page + $?10 11 ? Figure 1: Guarded Page Tables. Whenever entry 3 of the top-level page table is reached, we have to check whether \10 11" is a pre x of the remaining address. If so, this pre x can be stripped o , and the translation process can directly continue at the level-4 page table. Therefore, each entry is augmented with a bit string g of variable length, which is referred to as a guard. This is the key idea of guarded page tables. The translation process works as follows: rst, a page table entry is selected by the highest part of the virtual address upon each transformation step in the same way as in the conventional multi-level page table method. The selected entry however contains not only a pointer (and perhaps an access attribute) but also the guard g. If g is a pre x of the remaining virtual address, the translation process either continues with the remaining post x or terminates with the post x as page o set. As an example, Figure 2 2 presents the transformation of 20 address bits by 3 page tables. Note that v = 0 j 1100101 100101100111 0 1 q g =1100101 v0 = 10 j 0101 100111 00 01 10 11 BBBN q g =0101 v00 = 1 j 0 0111 0 1BBBN q g =0 o set = 0111 data page Figure 2: Guarded Page Table Tree the length of the guards may vary from entry to entry. Furthermore, page table sizes can be mixed; all powers of 2 are admissible. The same holds for data pages, i.e., a mixture of 2-, 4-, : : :1024-, : : :entry page tables and pages can be used. Guarded page tables contain conventional tables as a special case: if a guard has length zero, a translation step works exactly like in the conventional mechanism. However in all cases conventionally requiring a table with only one valid entry, a guard can be used instead. It can even replace a sequence of such \single-entry" page tables. This saves both memory capacity and transformation steps, i.e., guards act as a shortcut. 3 GPT Parser At rst, we describe a GPT translation step in general, independent of concrete hardware (see Figure 3). Here, v is the part of the original virtual address that is still subject to translation, and the pair (p; s) determines the page table (p: physical address, s: log2 of table size) that has to be used for the current translation step. The result of this step is either a new page table (p0; s0) and a post x v0 of v, or the data page (p0; s0) and o set v0. The translations step starts by extracting u, the uppermost s bits of v. u is used for indexing the page table. The addressed entry speci es a guard g of variable size, i.e. possibly empty, which is checked against the remaining bits of the virtual address (w = g). When equal, the remaining v0 is either used for the next level translation, or as the o set part. This operates as a shortcut, since not only u, but both u and w are stripped o the virtual address in one step; no table is necessary to decode w. Note that the width of u, (determined by the page table size), may vary from step to step and that the size of w may di er from entry to entry. 3 p; s v: u w v0 p0; s0 g ? g = w ? ? Figure 3: Guarded Translation Step In the following, we use jxj to denote the bit length of a exible bit string x. For improved clarity, we always use x0 for an item that belongs to next translation step (i.e., refers to the next lower level page table) and x for an item belonging to the current level. Assuming at rst 32-byte page table entries (later this is reduced to 16 bytes), one GPT translation step is: u := v (jvj s) ; g := [p+ 32u].guard ; if g = ((v (jvj s jgj)) AND (2jgj 1) then v0 := v AND 2jvj s jgj 1 ; s0 := [p+ 32u].size0 ; p0 := [p+ 32u].table0 ; else page fault . This algorithm cannot be implemented `as is', because the R4600 processor does not support variable length bit strings as a basic data type. Therefore, we have to hold jvj and jgj in additional variables vlen and glen: u := v (vlen s) ; g := [p+ 32u].guard ; glen := [p+ 32u].guard len ; if g = (v (vlen s glen)) AND (2glen 1) then v0 len := vlen s glen ; v0 := v AND 2v0 len 1 ; s0 := [p+ 32u].size0 ; p0 := [p+ 32u].table0 ; else page fault . After eliminating common subexpressions, this algorithm requires 17 arithmetic and load operations. 4 3.1 From 17 To 10 Operations Note that although v is an input variable of the translation process, the length jvj is a constant which is determined by the depth of the table in the GPT tree. Furthermore, the table size s and the guard length jgj are xed per page table entry. So the values s0 = vlen s s1 = vlen s glen gmask = 2glen 1 meaning v : u g v0 s0 z }| { | {z } s1 v0 : u0 g0 v00 s00 z }| { | {z } s01 can be computed when constructing a GPT entry and can be stored per entry. Note that we have to store the present level's s1 and the next level's s00 in a page table entry:guard s1 s00 table0 Fortunately, s00 can be as easily determined as s0, as s00 = v0 len s0 = vlen s glen s0 = s1 s0. The improved algorithm u := v s0 ; g := [p+ 32u].guard ; gmask := [p+ 32u]:gmask ; s1 := [p+ 32u]:s1 ; if g = (v s1) AND gmask then v0 := v AND 2s1 1 ; s00 := [p+ 32u]:s00 ; p0 := [p+ 32u].table0 ; else page fault . requires only 14 arithmetic/load ope