论文信息 - Revisiting virtual memory

Revisiting virtual memory

Page-based virtual memory (paging) is a crucial piece of memory management in today's computing systems. However, I find that need, purpose and design constraints of virtual memory have changed dramatically since translation lookaside buffers (TLBs) were introduced to cache recently-used address translations: (a) physical memory sizes have grown more than a million-fold, (b) workloads are often sized to avoid swapping information to and from secondary storage, and (c) energy is now a first-order design constraint. Nevertheless, level-one TLBs have remained the same size and are still accessed on every memory reference. As a result, large workloads waste considerable execution time on TLB misses and all workloads spend energy on frequent TLB accesses. In this thesis I argue that it is now time to reevaluate virtual memory management. I reexamine virtual memory subsystem considering the ever-growing latency overhead of address translation and considering energy dissipation, developing three results. First, I proposed direct segments to reduce the latency overhead of address translation for emerging big-memory workloads. Many big-memory workloads allocate most of their memory early in execution and do not benefit from paging. Direct segments enable hardware-OS mechanisms to bypass paging for a part of a process's virtual address space, eliminating nearly 99% of TLB miss for many of these workloads. Second, I proposed opportunistic virtual caching (OVC) to reduce the energy spent on translating addresses. Accessing TLBs on each memory reference burns significant energy, and virtual memory's page size constrains L1-cache designs to be highly associative—burning yet more energy. OVC makes hardware-OS modifications to expose energy-efficient virtual caching as a dynamic optimization. This saves 94-99% of TLB lookup energy and 23% of L1-cache lookup energy across several workloads. Third, large pages are likely to be more appropriate than direct segments to reduce TLB misses under frequent memory allocations/deallocations. Unfortunately, prevalent chip designs like Intel's, statically partition TLB resources among multiple page sizes, which can lead to performance pathologies for using large pages. I proposed the merged-associative TLB to avoid such pathologies and reduce TLB miss rate by up to 45% through dynamic aggregation of TLB resources across page sizes.

[1] Trevor N. Mudge,et al. Uniprocessor Virtual Memory without TLBs , 2001, IEEE Trans. Computers.

[2] Gürhan Küçük,et al. Reducing reorder buffer complexity through selective operand caching , 2003, ISLPED '03.

[3] Martín Abadi,et al. An Overview of the Singularity Project , 2005 .

[4] Balaram Sinharoy,et al. IBM POWER7 multicore server processor , 2011 .

[5] Csaba Andras Moritz,et al. Cool-Mem: combining statically speculative memory accessing with selective address translation for energy efficiency , 2002, ASPLOS X.

[6] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7] Rajiv Gupta,et al. Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[8] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[9] K. Diefendorff,et al. Evolution of the PowerPC architecture , 1994, IEEE Micro.

[10] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[11] Sang Lyul Min,et al. U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[12] Gabriel H. Loh,et al. Thermal analysis of a 3D die-stacked high-performance microprocessor , 2006, GLSVLSI '06.

[13] Alan L. Cox,et al. Protection Strategies for Direct Access to Virtualized I/O Devices , 2008, USENIX Annual Technical Conference.

[14] C. Hansen,et al. Table 2 , 2002, Equality and Non-Discrimination under the European Convention on Human Rights.

[15] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[16] Cameron McNairy,et al. Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[17] Douglas W. Clark,et al. A Characterization of Processor Performance in the vax-11/780 , 1984, ISCA '84.

[18] Xiangrong Zhou,et al. Heterogeneously tagged caches for low-power embedded systems with virtual memory support , 2008, TODE.

[19] Parthasarathy Ranganathan,et al. From Microprocessors to Nanostores: Rethinking Data-Centric Systems , 2011, Computer.

[20] Michel Cekleov,et al. Virtual-address caches. Part 1: problems and solutions in uniprocessors , 1997, IEEE Micro.

[21] Jeffrey S. Chase,et al. Lightweight shared objects in a 64-bit operating system , 1992, OOPSLA 1992.

[22] Uri C. Weiser,et al. Proceedings of the 37th annual international symposium on Computer architecture , 2010, ISCA 2010.

[23] Charles C. Weems,et al. Selective block buffering TLB system for embedded processors , 2005 .

[24] Anand Sivasubramaniam,et al. Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[25] W. H. Wang,et al. Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.

[26] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[27] Ole Agesen,et al. A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[28] Parag Agrawal,et al. The case for RAMCloud , 2011, Commun. ACM.

[29] Michel Dubois,et al. Virtual-address caches.2. Multiprocessor issues , 1997, IEEE Micro.

[30] David E. Culler,et al. Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[31] Sriram Sankar,et al. Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[32] Yen-Jen Chang,et al. Two New Techniques Integrated for Energy-Efficient TLB Design , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33] James R. Goodman. Coherency for multiprocessor virtual address caches , 1987, ASPLOS 1987.

[34] Alan L. Cox,et al. Practical, transparent operating system support for superpages , 2002, OPSR.

[35] Dong Tang,et al. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[36] Tom Kilburn,et al. One-Level Storage System , 1962, IRE Trans. Electron. Comput..

[37] Dharma P. Agrawal,et al. Proceedings of the 11th annual international symposium on Computer architecture , 1984 .

[38] Trevor N. Mudge,et al. Virtual memory in contemporary microprocessors , 1998, IEEE Micro.

[39] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.

[40] Gernot Heiser,et al. Fast address-space switching on the StrongARM SA-1100 processor , 2000, Proceedings 5th Australasian Computer Architecture Conference. ACAC 2000 (Cat. No.PR00512).

[41] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[42] Margaret Martonosi,et al. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[43] Barton P. Miller,et al. Virtual machine-provided context sensitive page mappings , 2008, VEE '08.

[44] Allan Gottlieb. Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, Australia, May 1992 , 1992, ISCA.

[45] Per Stenström,et al. Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[46] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[47] Mahmut T. Kandemir,et al. Reducing Data TLB Power via Compiler-Directed Address Generation , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[48] Kevin Skadron,et al. Proceedings 29th Annual International Symposium on Computer Architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[49] Michael M. Swift,et al. Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[50] Michael M. Swift,et al. Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[51] Qing Yang,et al. Proceedings of the 38th annual international symposium on Computer architecture , 2011, ISCA 2011.

[52] Hsien-Hsin S. Lee,et al. Reducing energy of virtual cache synonym lookup using bloom filters , 2006, CASES '06.

[53] Carl A. Waldspurger,et al. Memory resource management in VMware ESX server , 2002, OSDI '02.

[54] Lixin Zhang,et al. Enigma: architectural and operating system support for reducing the impact of address translation , 2010, ICS '10.

[55] Hsien-Hsin S. Lee,et al. Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning , 2003, ISLPED '03.

[56] Donald J. Patterson,et al. Computer organization and design: the hardware-software interface (appendix a , 1993 .

[57] Anand Sivasubramaniam,et al. Generating physical addresses directly for saving instruction TLB energy , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[58] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[59] Margaret Martonosi,et al. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[60] Margaret Martonosi,et al. Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS 2010.

[61] Mark D. Hill,et al. Tradeoffs in supporting two page sizes , 1992, ISCA '92.

[62] Mateo Valero,et al. Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[63] Narayanan Ganapathy,et al. General Purpose Operating System Support for Multiple Page Sizes , 1998, USENIX Annual Technical Conference.

[64] Srilatha Manne,et al. Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[65] Peter J. Denning. Virtual Memory , 1996, ACM Comput. Surv..

[66] Norman P. Jouppi,et al. A simulation based study of TLB performance , 1992, ISCA '92.

[67] David A. Wood,et al. An in-cache address translation mechanism , 1986, ISCA '86.

[68] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[69] Peter J. Denning,et al. The working set model for program behavior , 1968, CACM.

[70] Jack B. Dennis,et al. Virtual memory, processes, and sharing in Multics , 1967, CACM.

[71] Michel Dubois,et al. The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches , 2008, IEEE Transactions on Computers.

[72] Jaehyuk Huh,et al. Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[73] Mark D. Hill,et al. Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[74] Per Stenström,et al. TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[75] Jeffrey S. Chase,et al. Architecture support for single address space operating systems , 1992, ASPLOS V.

[76] Tomás Lang,et al. Reducing TLB power requirements , 1997, Proceedings of 1997 International Symposium on Low Power Electronics and Design.

[77] Mahmut T. Kandemir,et al. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[78] Collin McCurdy,et al. Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[79] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[80] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.