Compendia: reducing virtual-memory costs via selective densification

Virtual-to-physical memory translation is becoming an increasingly dominant cost in workload execution; as data sizes scale, up to four memory accesses are required per translation, and 24 in virtualised systems. However, the radix trees in use today to hold these translations have many favorable properties, including cacheability, ability to fit in conventional 4 KiB page frames, and a sparse representation. They are therefore unlikely to be replaced in the near future. In this paper we argue that these structures are actually too sparse for modern workloads, so many of the overheads are unnecessary. Instead, where appropriate, we expand groups of 4 KiB layers, each able to translate 9 bits of address space, into a single 2 MiB layer, able to translate 18 bits in a single memory access. These fit in the standard huge-page allocations used by most conventional operating systems and architectures. With minor extensions to the page-table-walker structures to support these, and aid in their cacheability, we can reduce memory accesses per walk by 27%, or 56% for virtualised systems, without significant memory overhead.

[1]  Jaehyuk Huh,et al.  Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[2]  K. Gopinath,et al.  HawkEye: Efficient Fine-grained OS Support for Huge Pages , 2019, ASPLOS.

[3]  Josep Torrellas,et al.  BabelFish: Fusing Address Translations for Containers , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[4]  Osman S. Unsal,et al.  Performance analysis of the memory management unit under scale-out workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[6]  Abhishek Bhattacharjee,et al.  Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[7]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Boris Grot,et al.  Prefetched Address Translation , 2019, MICRO.

[9]  Michael Stonebraker,et al.  Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores , 2014, Proc. VLDB Endow..

[10]  Jaehyuk Huh,et al.  Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[12]  Mark D. Hill,et al.  A new page table for 64-bit address spaces , 1995, SOSP.

[13]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[14]  Jóakim von Kistowski,et al.  SPEC CPU2017: Next-Generation Compute Benchmark , 2018, ICPE Companion.

[15]  Simone Campanoni,et al.  CARAT: a case for virtual memory through compiler- and runtime-based address translation , 2020, PLDI.

[16]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[17]  Herbert Bos,et al.  RevAnC: A Framework for Reverse Engineering Hardware Page Table Caches , 2017, EUROSEC.

[18]  Michael M. Swift,et al.  BadgerTrap: a tool to instrument x86-64 TLB misses , 2014, CARN.

[19]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[20]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[21]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[22]  Youngjin Kwon,et al.  Ingens: Huge Page Support for the OS and Hypervisor , 2017, OPSR.

[23]  Jerry Huck,et al.  Architectural support for translation table management in large address space machines , 1993, ISCA '93.

[24]  Dan Tsafrir,et al.  Hash, Don't Cache (the Page Table) , 2016, SIGMETRICS.

[25]  Zi Yan,et al.  Translation Ranger: Operating System Support for Contiguity-Aware TLBs , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[26]  Daniel Lustig,et al.  Architectural and Operating System Support for Virtual Memory , 2017, Architectural and Operating System Support for Virtual Memory.

[27]  Ching-Yung Lin,et al.  GraphBIG: understanding graph computing in the context of industrial solutions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[29]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30]  Michael M. Swift,et al.  Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[31]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[32]  Yale N. Patt,et al.  Tailored Page Sizes , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[33]  Jee Ho Ryoo,et al.  Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34]  Thomas F. Wenisch,et al.  Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[35]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[36]  Tianhao Zhang,et al.  Do-it-yourself virtual memory translation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[37]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[38]  Lizy Kurian John,et al.  CSALT: Context Switch Aware Large TLB* , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[40]  Nectarios Koziris,et al.  Enhancing and Exploiting Contiguity for Fast Memory Virtualization , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[41]  Josep Torrellas,et al.  Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism , 2020, ASPLOS.