Revisiting virtual memory

Page-based virtual memory (paging) is a crucial piece of memory management in today's computing systems. However, I find that need, purpose and design constraints of virtual memory have changed dramatically since translation lookaside buffers (TLBs) were introduced to cache recently-used address translations: (a) physical memory sizes have grown more than a million-fold, (b) workloads are often sized to avoid swapping information to and from secondary storage, and (c) energy is now a first-order design constraint. Nevertheless, level-one TLBs have remained the same size and are still accessed on every memory reference. As a result, large workloads waste considerable execution time on TLB misses and all workloads spend energy on frequent TLB accesses. In this thesis I argue that it is now time to reevaluate virtual memory management. I reexamine virtual memory subsystem considering the ever-growing latency overhead of address translation and considering energy dissipation, developing three results. First, I proposed direct segments to reduce the latency overhead of address translation for emerging big-memory workloads. Many big-memory workloads allocate most of their memory early in execution and do not benefit from paging. Direct segments enable hardware-OS mechanisms to bypass paging for a part of a process's virtual address space, eliminating nearly 99% of TLB miss for many of these workloads. Second, I proposed opportunistic virtual caching (OVC) to reduce the energy spent on translating addresses. Accessing TLBs on each memory reference burns significant energy, and virtual memory's page size constrains L1-cache designs to be highly associative—burning yet more energy. OVC makes hardware-OS modifications to expose energy-efficient virtual caching as a dynamic optimization. This saves 94-99% of TLB lookup energy and 23% of L1-cache lookup energy across several workloads. Third, large pages are likely to be more appropriate than direct segments to reduce TLB misses under frequent memory allocations/deallocations. Unfortunately, prevalent chip designs like Intel's, statically partition TLB resources among multiple page sizes, which can lead to performance pathologies for using large pages. I proposed the merged-associative TLB to avoid such pathologies and reduce TLB miss rate by up to 45% through dynamic aggregation of TLB resources across page sizes.

[1]  Trevor N. Mudge,et al.  Uniprocessor Virtual Memory without TLBs , 2001, IEEE Trans. Computers.

[2]  Gürhan Küçük,et al.  Reducing reorder buffer complexity through selective operand caching , 2003, ISLPED '03.

[3]  Martín Abadi,et al.  An Overview of the Singularity Project , 2005 .

[4]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[5]  Csaba Andras Moritz,et al.  Cool-Mem: combining statically speculative memory accessing with selective address translation for energy efficiency , 2002, ASPLOS X.

[6]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[8]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  K. Diefendorff,et al.  Evolution of the PowerPC architecture , 1994, IEEE Micro.

[10]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[11]  Sang Lyul Min,et al.  U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[12]  Gabriel H. Loh,et al.  Thermal analysis of a 3D die-stacked high-performance microprocessor , 2006, GLSVLSI '06.

[13]  Alan L. Cox,et al.  Protection Strategies for Direct Access to Virtualized I/O Devices , 2008, USENIX Annual Technical Conference.

[14]  C. Hansen,et al.  Table 2 , 2002, Equality and Non-Discrimination under the European Convention on Human Rights.

[15]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[16]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[17]  Douglas W. Clark,et al.  A Characterization of Processor Performance in the vax-11/780 , 1984, ISCA '84.

[18]  Xiangrong Zhou,et al.  Heterogeneously tagged caches for low-power embedded systems with virtual memory support , 2008, TODE.

[19]  Parthasarathy Ranganathan,et al.  From Microprocessors to Nanostores: Rethinking Data-Centric Systems , 2011, Computer.

[20]  Michel Cekleov,et al.  Virtual-address caches. Part 1: problems and solutions in uniprocessors , 1997, IEEE Micro.

[21]  Jeffrey S. Chase,et al.  Lightweight shared objects in a 64-bit operating system , 1992, OOPSLA 1992.

[22]  Uri C. Weiser,et al.  Proceedings of the 37th annual international symposium on Computer architecture , 2010, ISCA 2010.

[23]  Charles C. Weems,et al.  Selective block buffering TLB system for embedded processors , 2005 .

[24]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[25]  W. H. Wang,et al.  Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.

[26]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[27]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[28]  Parag Agrawal,et al.  The case for RAMCloud , 2011, Commun. ACM.

[29]  Michel Dubois,et al.  Virtual-address caches.2. Multiprocessor issues , 1997, IEEE Micro.

[30]  David E. Culler,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[31]  Sriram Sankar,et al.  Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[32]  Yen-Jen Chang,et al.  Two New Techniques Integrated for Energy-Efficient TLB Design , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  James R. Goodman Coherency for multiprocessor virtual address caches , 1987, ASPLOS 1987.

[34]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[35]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[36]  Tom Kilburn,et al.  One-Level Storage System , 1962, IRE Trans. Electron. Comput..

[37]  Dharma P. Agrawal,et al.  Proceedings of the 11th annual international symposium on Computer architecture , 1984 .

[38]  Trevor N. Mudge,et al.  Virtual memory in contemporary microprocessors , 1998, IEEE Micro.

[39]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[40]  Gernot Heiser,et al.  Fast address-space switching on the StrongARM SA-1100 processor , 2000, Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512).

[41]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[42]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[43]  Barton P. Miller,et al.  Virtual machine-provided context sensitive page mappings , 2008, VEE '08.

[44]  Allan Gottlieb Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, Australia, May 1992 , 1992, ISCA.

[45]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[46]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[47]  Mahmut T. Kandemir,et al.  Reducing Data TLB Power via Compiler-Directed Address Generation , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[48]  Kevin Skadron,et al.  Proceedings 29th Annual International Symposium on Computer Architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[49]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[50]  Michael M. Swift,et al.  Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[51]  Qing Yang,et al.  Proceedings of the 38th annual international symposium on Computer architecture , 2011, ISCA 2011.

[52]  Hsien-Hsin S. Lee,et al.  Reducing energy of virtual cache synonym lookup using bloom filters , 2006, CASES '06.

[53]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[54]  Lixin Zhang,et al.  Enigma: architectural and operating system support for reducing the impact of address translation , 2010, ICS '10.

[55]  Hsien-Hsin S. Lee,et al.  Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning , 2003, ISLPED '03.

[56]  Donald J. Patterson,et al.  Computer organization and design: the hardware-software interface (appendix a , 1993 .

[57]  Anand Sivasubramaniam,et al.  Generating physical addresses directly for saving instruction TLB energy , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[58]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[59]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[60]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS 2010.

[61]  Mark D. Hill,et al.  Tradeoffs in supporting two page sizes , 1992, ISCA '92.

[62]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[63]  Narayanan Ganapathy,et al.  General Purpose Operating System Support for Multiple Page Sizes , 1998, USENIX Annual Technical Conference.

[64]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[65]  Peter J. Denning Virtual Memory , 1996, ACM Comput. Surv..

[66]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[67]  David A. Wood,et al.  An in-cache address translation mechanism , 1986, ISCA '86.

[68]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[69]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[70]  Jack B. Dennis,et al.  Virtual memory, processes, and sharing in Multics , 1967, CACM.

[71]  Michel Dubois,et al.  The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches , 2008, IEEE Transactions on Computers.

[72]  Jaehyuk Huh,et al.  Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[73]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[74]  Per Stenström,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[75]  Jeffrey S. Chase,et al.  Architecture support for single address space operating systems , 1992, ASPLOS V.

[76]  Tomás Lang,et al.  Reducing TLB power requirements , 1997, Proceedings of 1997 International Symposium on Low Power Electronics and Design.

[77]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[78]  Collin McCurdy,et al.  Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[79]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[80]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.