Architectural and Operating System Support for Virtual Memory

This book provides computer engineers, academic researchers, new graduate students, and seasoned practitioners an end-to-end overview of virtual memory. We begin with a recap of foundational concepts and discuss not only state-of-the-art virtual memory hardware and software support available today, but also emerging research trends in this space. The span of topics covers processor microarchitecture, memory systems, operating system design, and memory allocation. We show how efficient virtual memory implementations hinge on careful hardware and software cooperation, and we discuss new research directions aimed at addressing emerging problems in this space. Virtual memory is a classic computer science abstraction and one of the pillars of the computing revolution. It has long enabled hardware flexibility, software portability, and overall better security, to name just a few of its powerful benefits. Nearly all user-level programs today take for granted that they will have been freed from the burden of physical memory management by the hardware, the operating system, device drivers, and system libraries. However, despite its ubiquity in systems ranging from warehouse-scale datacenters to embedded Internet of Things (IoT) devices, the overheads of virtual memory are becoming a critical performance bottleneck today. Virtual memory architectures designed for individual CPUs or even individual cores are in many cases struggling to scale up and scale out to today's systems which now increasingly include exotic hardware accelerators (such as GPUs, FPGAs, or DSPs) and emerging memory technologies (such as non-volatile memory), and which run increasingly intensive workloads (such as virtualized and/or "big data" applications). As such, many of the fundamental abstractions and implementation approaches for virtual memory are being augmented, extended, or entirely rebuilt in order to ensure that virtual memory remains viable and performant in the years to come.

[1]  Leonidas J. Guibas,et al.  A dichromatic framework for balanced trees , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[2]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[3]  Dan Tsafrir,et al.  Hash, Don't Cache (the Page Table) , 2016, SIGMETRICS.

[4]  Onur Mutlu,et al.  Page overlays: An enhanced virtual memory framework to enable fine-grained memory management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  G. Kandiraju,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[6]  Alan L. Cox,et al.  Shared address translation revisited , 2016, EuroSys.

[7]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[8]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9]  Abhishek Bhattacharjee,et al.  Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[10]  Gurindar S. Sohi,et al.  Revisiting virtual L1 caches: A practical design using dynamic synonym remapping , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Abhishek Bhattacharjee,et al.  Translation-Triggered Prefetching , 2017, ASPLOS.

[12]  Alfred V. Aho,et al.  Principles of Optimal Page Replacement , 1971, J. ACM.

[13]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Xi Wang,et al.  Specifying and Checking File System Crash-Consistency Models , 2016, ASPLOS.

[15]  Margaret Martonosi,et al.  COATCheck: Verifying Memory Ordering at the Hardware-OS Interface , 2016, ASPLOS.

[16]  Nadav Amit,et al.  Optimizing the TLB Shootdown Algorithm with Page Access Tracking , 2017, USENIX Annual Technical Conference.

[17]  Ján Veselý,et al.  Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Aamer Jaleel,et al.  In-line interrupt handling for software-managed TLBs , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[19]  Daniel J. Sorin,et al.  Specifying and dynamically verifying address translation-aware memory consistency , 2010, ASPLOS XV.

[20]  M. Frans Kaashoek,et al.  RadixVM: scalable address spaces for multithreaded applications , 2013, EuroSys '13.

[21]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  Yuanyuan Zhou,et al.  The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[23]  Vinod Ganapathy,et al.  A 3D-Stacked Architecture for Secure Memory Acquisition , 2016 .

[24]  Stefanos Kaxiras,et al.  A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[25]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[26]  Michael M. Swift,et al.  Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[27]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[28]  Peter J. Denning Virtual Memory , 1996, ACM Comput. Surv..

[29]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.

[30]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[31]  Juan E. Navarro,et al.  Practical, transparent operating system support for superpages , 2002, OSDI '02.

[32]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[33]  Osman S. Unsal,et al.  Energy-efficient address translation , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[35]  John L. Hennessy,et al.  WSCLOCK—a simple and effective algorithm for virtual memory management , 1981, SOSP.

[36]  Laszlo A. Belady,et al.  An anomaly in space-time characteristics of certain programs running in a paging machine , 1969, CACM.

[37]  D. Stewart,et al.  The missing memristor found , 2008, Nature.

[38]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[39]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[41]  Andrea C. Arpaci-Dusseau,et al.  Geiger: monitoring the buffer cache in a virtual machine environment , 2006, ASPLOS XII.

[42]  Jeff Bonwick,et al.  The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.

[43]  Song Jiang,et al.  LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.

[44]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[45]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[46]  Donald E. Knuth,et al.  fundamental algorithms , 1969 .

[47]  Bradley C. Kuszmaul SuperMalloc: a super fast multithreaded malloc for 64-bit machines , 2015, ISMM.

[48]  Michel Dubois,et al.  VIRTUAL-ADDRESS CACHES , 1997 .

[49]  Cristiano Giuffrida,et al.  Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization , 2012, USENIX Security Symposium.

[50]  Richard Draves,et al.  Page Replacement and Reference Bit Emulation in Mach , 1991, USENIX MACH Symposium.

[51]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[52]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[53]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[54]  Girish Venkatasubramanian,et al.  A Simulation Analysis of Shared TLBs with Tag Based Partitioning in Multicore Virtualized Environments , 2009 .

[55]  M. Frans Kaashoek,et al.  Scalable address spaces using RCU balanced trees , 2012, ASPLOS XVII.

[56]  Song Jiang,et al.  CLOCK-Pro: An Effective Improvement of the CLOCK Replacement , 2005, USENIX ATC, General Track.

[57]  Thomas F. Wenisch,et al.  Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[58]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[59]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[60]  Peter Davies,et al.  The TLB slice-a low-cost high-speed address translation mechanism , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[61]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[62]  Hovav Shacham,et al.  On the effectiveness of address-space randomization , 2004, CCS '04.

[63]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[64]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[65]  Rami G. Melhem,et al.  Supporting superpages in non-contiguous physical memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[66]  Rajeev Barua,et al.  Heterogeneous memory management for embedded systems , 2001, CASES '01.

[67]  Xin Tong,et al.  Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[68]  Peng Ning,et al.  Address Space Layout Permutation (ASLP): Towards Fine-Grained Randomization of Commodity Software , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[69]  Jerry Huck,et al.  Architectural support for translation table management in large address space machines , 1993, ISCA '93.

[70]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[71]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[72]  Mark Oskin,et al.  A Software-Managed Approach to Die-Stacked DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[73]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[74]  Derek Hower,et al.  TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches , 2018, IEEE Computer Architecture Letters.

[75]  Ying Ye,et al.  COLORIS: A dynamic cache partitioning system using page coloring , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[76]  Margaret Martonosi,et al.  Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors , 2010, ASPLOS 2010.

[77]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[78]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[79]  Peter J. Denning,et al.  Properties of the working-set model , 1972, CACM.

[80]  Xiaoning Ding,et al.  DULO: an effective buffer cache management scheme to exploit both temporal and spatial locality , 2005, FAST'05.

[81]  Sang Lyul Min,et al.  On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies , 1999, SIGMETRICS '99.

[82]  Trevor N. Mudge,et al.  Design Tradeoffs For Software-managed Tlbs , 1994, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[83]  Jee Ho Ryoo,et al.  Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[84]  Yu Zhang,et al.  Improving virtualization in the presence of software managed translation lookaside buffers , 2013, ISCA.

[85]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[86]  Henry M. Levy,et al.  Segmented FIFO page replacement , 1981, SIGMETRICS '81.

[87]  Taesoo Kim,et al.  Breaking Kernel Address Space Layout Randomization with Intel TSX , 2016, CCS.

[88]  Trevor N. Mudge,et al.  Virtual memory in contemporary microprocessors , 1998, IEEE Micro.

[89]  Emery D. Berger,et al.  CRAMM: virtual memory support for garbage-collected applications , 2006, OSDI '06.

[90]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[91]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.