Enhancing and Exploiting Contiguity for Fast Memory Virtualization

We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging applies to the hypervisor and guest OS memory manager independently, as well as to native systems. Moreover, CA paging benefits any address translation scheme that leverages contiguous mappings. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations in both native and virtualized systems. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and (ii) SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.

[1]  Zi Yan,et al.  Translation Ranger: Operating System Support for Contiguity-Aware TLBs , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[2]  Abhishek Bhattacharjee Preserving Virtual Memory by Mitigating the Address Translation Wall , 2017, IEEE Micro.

[3]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Ján Veselý,et al.  Hardware translation coherence for virtualized systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[6]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Josep Torrellas,et al.  Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism , 2020, ASPLOS.

[8]  Frank Piessens,et al.  A Systematic Evaluation of Transient Execution Attacks and Defenses , 2018, USENIX Security Symposium.

[9]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[10]  Nadav Amit,et al.  Don't shoot down TLB shootdowns! , 2020, EuroSys.

[11]  Nael B. Abu-Ghazaleh,et al.  SafeSpec: Banishing the Spectre of a Meltdown with Leakage-Free Speculation , 2018, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[12]  K. Gopinath,et al.  HawkEye: Efficient Fine-grained OS Support for Huge Pages , 2019, ASPLOS.

[13]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[14]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[15]  Xin Tong,et al.  Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[16]  Alex Delis,et al.  MEGA: overcoming traditional problems with OS huge page management , 2019, SYSTOR.

[17]  Lizy Kurian John,et al.  CSALT: Context Switch Aware Large TLB* , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[19]  Josep Torrellas,et al.  InvisiSpec: Making Speculative Execution Invisible in the Cache Hierarchy , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[21]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[22]  Wolfgang Rehm,et al.  Paging Method Switching for QEMU-KVM Guest Machine , 2014, BigDataScience '14.

[23]  Mohan Kumar,et al.  LATR: Lazy Translation Coherence , 2018, ASPLOS.

[24]  Gregory T. Byrd,et al.  Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior , 2019, ICS.

[25]  Tianhao Zhang,et al.  Do-it-yourself virtual memory translation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[26]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Abhishek Bhattacharjee,et al.  Translation-Triggered Prefetching , 2017, ASPLOS.

[28]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[29]  Abhishek Bhattacharjee,et al.  Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[30]  Lixin Zhang,et al.  Enigma: architectural and operating system support for reducing the impact of address translation , 2010, ICS '10.

[31]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[32]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[33]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[34]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[35]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[36]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37]  Yan Solihin,et al.  Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  Rami G. Melhem,et al.  Supporting superpages in non-contiguous physical memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[39]  Herbert Bos,et al.  Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks , 2018, USENIX Security Symposium.

[40]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[41]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[42]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[43]  Herbert Bos,et al.  Malicious Management Unit: Why Stopping Cache Attacks in Software is Harder Than You Think , 2018, USENIX Security Symposium.

[44]  H. Reza Taheri,et al.  Performance Implications of Extended Page Tables on Virtualized x86 Processors , 2016, VEE.

[45]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[46]  David Black-Schaffer,et al.  Perforated Page: Supporting Fragmented Memory Allocation for Large Pages , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[47]  Zi Yan,et al.  Nimble Page Management for Tiered Memory Systems , 2019, ASPLOS.

[48]  Jee Ho Ryoo,et al.  Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[49]  Zhen Fang,et al.  Reevaluating online superpage promotion with hardware support , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[50]  Boris Grot,et al.  Prefetched Address Translation , 2019, MICRO.

[51]  Patrick Healy,et al.  Supporting superpage allocation without additional hardware support , 2008, ISMM '08.

[52]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[53]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[54]  Jaehyuk Huh,et al.  Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[55]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[56]  Mattan Erez,et al.  SIPT: Speculatively Indexed, Physically Tagged Caches , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[57]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[58]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[59]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[60]  Nadav Amit,et al.  Optimizing the TLB Shootdown Algorithm with Page Access Tracking , 2017, USENIX Annual Technical Conference.

[61]  Ján Veselý,et al.  Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[63]  Michael M. Swift,et al.  BadgerTrap: a tool to instrument x86-64 TLB misses , 2014, CARN.

[64]  Narayanan Ganapathy,et al.  General Purpose Operating System Support for Multiple Page Sizes , 1998, USENIX Annual Technical Conference.

[65]  Dan Tsafrir,et al.  Hash, Don't Cache (the Page Table) , 2016, SIGMETRICS.

[66]  Timothy Roscoe,et al.  Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines , 2019, ASPLOS.

[67]  Michael M. Swift,et al.  Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[68]  Yingwei Luo,et al.  Selective hardware/software memory virtualization , 2011, VEE '11.

[69]  Jaehyuk Huh,et al.  Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[70]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[71]  Michael M. Swift,et al.  Devirtualizing Memory in Heterogeneous Systems , 2018, ASPLOS.