Efficient Memory Virtualization

Two important trends in computing are evident. First, computing is becoming more data centric, where low-latency access to a very large amount of data is critical. Second, virtual machines are playing an increasing critical role in server consolidation, security and fault tolerance as substantial amounts of computing migrate to shared resources in cloud services. Since the software stack accesses data using virtual addresses, fast address translation is a prerequisite for efficient data-centric computation and for providing the benefits of virtualization to a wide range of applications. Unfortunately, the growth in physical memory sizes is exceeding the capabilities of the most widely used virtual memory abstraction—paging—that has worked for decades. This thesis addresses the above challenge in a comprehensive manner proposing a hardware/software co-design for fast address translation in both virtualized and native systems to address the needs of a wide variety of big-memory workloads. This dissertation aims to achieve near-zero overheads for virtual memory for both native and virtualized systems. First, we observe that the overheads of page-based virtual memory can increase drastically with virtual machines. We previously proposed direct segments, which use a form of contiguous allocation in memory along with paging to largely eliminate virtual memory overhead for big-memory workloads on unvirtualized hardware. However, direct segments

[1]  Scott Devine,et al.  Disco: running commodity operating systems on scalable multiprocessors , 1997, TOCS.

[2]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[3]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[4]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[5]  G. Kandiraju,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[6]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[7]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[8]  Stéphan Jourdan,et al.  Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[9]  Timothy Sherwood,et al.  A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[10]  Kathryn S. McKinley,et al.  Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance , 2008, PLDI '08.

[11]  Mark D. Hill,et al.  Tradeoffs in supporting two page sizes , 1992, ISCA '92.

[12]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[13]  Thomas F. Wenisch,et al.  Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[14]  Yu Zhang,et al.  Improving virtualization in the presence of software managed translation lookaside buffers , 2013, ISCA.

[15]  Stephen Phillips,et al.  M7: Next generation SPARC , 2014, IEEE Hot Chips Symposium.

[16]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[17]  Barton P. Miller,et al.  Virtual machine-provided context sensitive page mappings , 2008, VEE '08.

[18]  Osman S. Unsal,et al.  Range Translations for Fast Virtual Memory , 2016, IEEE Micro.

[19]  Peter A. Dinda,et al.  A Case for Alternative Nested Paging Models for Virtualized Systems , 2010, IEEE Computer Architecture Letters.

[20]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Tom Kilburn,et al.  One-Level Storage System , 1962, IRE Trans. Electron. Comput..

[22]  Xin Tong,et al.  Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[23]  Osman S. Unsal,et al.  Performance analysis of the memory management unit under scale-out workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Ján Veselý,et al.  Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[26]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  Michael M. Swift,et al.  BadgerTrap: a tool to instrument x86-64 TLB misses , 2014, CARN.

[28]  Narayanan Ganapathy,et al.  General Purpose Operating System Support for Multiple Page Sizes , 1998, USENIX Annual Technical Conference.

[29]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[30]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[31]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[32]  Daniel Sánchez,et al.  Implementing Signatures for Transactional Memory , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[33]  Rami G. Melhem,et al.  Supporting superpages in non-contiguous physical memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[34]  Jack B. Dennis,et al.  Virtual memory, processes, and sharing in Multics , 1967, SOSP 1967.

[35]  Muli Ben-Yehuda,et al.  The Turtles Project: Design and Implementation of Nested Virtualization , 2010, OSDI.

[36]  Todd M. Austin,et al.  A case for unlimited watchpoints , 2012, ASPLOS XVII.

[37]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[38]  Trevor N. Mudge,et al.  Uniprocessor Virtual Memory without TLBs , 2001, IEEE Trans. Computers.

[39]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[40]  Zhen Fang,et al.  Reevaluating online superpage promotion with hardware support , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[41]  David Black-Schaffer,et al.  Navigating the cache hierarchy with a single lookup , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[42]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[43]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[44]  Randy H. Katz,et al.  An in-cache address translation mechanism , 1986, ISCA '86.

[45]  Juan E. Navarro,et al.  Practical, transparent operating system support for superpages , 2002, OSDI '02.

[46]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[47]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[48]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[49]  Radu Rugina,et al.  Software Techniques for Avoiding Hardware Virtualization Exits , 2012, USENIX Annual Technical Conference.

[50]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[51]  Haibo Chen,et al.  CloudVisor: retrofitting protection of virtual machines in multi-tenant cloud with nested virtualization , 2011, SOSP.

[52]  Hoi-Jun Yoo,et al.  Bitwise Competition Logic for compact digital comparator , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[53]  Alex Garthwaite,et al.  The evolution of an x86 virtual machine monitor , 2010, OPSR.

[54]  Yingwei Luo,et al.  Selective hardware/software memory virtualization , 2011, VEE '11.

[55]  Jaehyuk Huh,et al.  Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[56]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[57]  Ken Kennedy,et al.  Inter-array Data Regrouping , 1999, LCPC.

[58]  David Black-Schaffer,et al.  TLC: A tag-less cache for reducing dynamic first level cache energy , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[60]  Robert P. Goldberg,et al.  Survey of virtual machine research , 1974, Computer.

[61]  Thomas A. Ziaja,et al.  Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.

[62]  Kamil Iskra,et al.  Characterizing the Performance of “Big Memory” on Blue Gene Linux , 2009, 2009 International Conference on Parallel Processing Workshops.

[63]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[64]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[65]  Michael M. Swift,et al.  Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[66]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[67]  Indira Subramanian,et al.  Implementation of Multiple Pagesize Support in HP-UX , 1998, USENIX Annual Technical Conference.

[68]  Mendel Rosenblum,et al.  Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[69]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[70]  Hakim Weatherspoon,et al.  The Xen-Blanket: virtualize once, run everywhere , 2012, EuroSys '12.

[71]  Performance Evaluation of Intel EPT Hardware Assist , 2006 .

[72]  Gurindar S. Sohi,et al.  Revisiting virtual L1 caches: A practical design using dynamic synonym remapping , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[73]  Damian L. Osisek,et al.  ESA/390 Interpretive-Execution Architecture, Foundation for VM/ESA , 1991, IBM Syst. J..

[74]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[75]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[76]  Paul D. King,et al.  Design of the B 5000 System , 1987, Annals of the History of Computing.