论文信息 - Prefetched Address Translation

Prefetched Address Translation

With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk -- a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table. Anticipating further growth in dataset sizes and their adverse affect on TLB hit rates, this work seeks to accelerate page walks while fully preserving existing virtual memory abstractions and mechanisms -- a must for software compatibility and generality. Our idea is to enable direct indexing into a given level of the page table, thus eliding the need to first fetch pointers from the preceding levels. A key contribution of our work is in showing that this can be done by simply ordering the pages containing the page table in physical memory to match the order of the virtual memory pages they map to. Doing so enables direct indexing into the page table using a base-plus-offset arithmetic. We introduce Address Translation with Prefetching (ASAP), a new approach for reducing the latency of address translation to a single access to the memory hierarchy. Upon a TLB miss, ASAP launches prefetches to the deeper levels of the page table, bypassing the preceding levels. These prefetches happen concurrently with a conventional page walk, which observes a latency reduction due to prefetching while guaranteeing that only correctly-predicted entries are consumed. ASAP requires minimal extensions to the OS and trivial microarchitectural support. Moreover, ASAP is fully legacy-preserving, requiring no modifications to the existing radix tree-based page table, TLBs and other software and hardware mechanisms for address translation. Our evaluation on a range of memory-intensive workloads shows that under SMT colocation, ASAP is able to reduce page walk latency by an average of 25% (42% max) in native execution, and 45% (55% max) under virtualization.

[1] Jaehyuk Huh,et al. Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[2] Mark D. Hill,et al. Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[3] Rivalino Matias,et al. Experimental evaluation of software aging effects on the eucalyptus cloud computing infrastructure , 2011, Middleware '11.

[4] Thomas F. Wenisch,et al. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution , 2018, USENIX Security Symposium.

[5] Abhishek Bhattacharjee,et al. Translation-Triggered Prefetching , 2017, ASPLOS.

[6] Tianhao Zhang,et al. Do-it-yourself virtual memory translation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7] Gu-Yeon Wei,et al. Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8] Qiang Wu,et al. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9] Gabriel H. Loh,et al. Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[10] Derek Bruening,et al. Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[11] Margaret Martonosi,et al. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[12] Frederic T. Chong,et al. Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13] Jee Ho Ryoo,et al. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14] Michael Hamburg,et al. Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[15] Anand Sivasubramaniam,et al. Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[16] Osman S. Unsal,et al. Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17] Rami G. Melhem,et al. Supporting superpages in non-contiguous physical memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[18] Herbert Bos,et al. RevAnC: A Framework for Reverse Engineering Hardware Page Table Caches , 2017, EUROSEC.

[19] Dan Tsafrir,et al. Hardware and Software Support for Virtualization , 2017, Synthesis Lectures on Computer Architecture.

[20] Xin Tong,et al. Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21] Yingwei Luo,et al. HUB: hugepage ballooning in kernel-based virtual machines , 2018, MEMSYS.

[22] Abhishek Bhattacharjee,et al. Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[23] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.

[24] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[25] Abhishek Verma,et al. Large-scale cluster management at Google with Borg , 2015, EuroSys.

[26] Michael M. Swift,et al. Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[27] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.

[28] Michael M. Swift,et al. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[29] Carl A. Waldspurger,et al. Memory resource management in VMware ESX server , 2002, OSDI '02.

[30] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[31] Dmitri B. Strukov,et al. Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[33] Kimberly Keeton,et al. Memory-Driven Computing , 2017, FAST.

[34] Dan Tsafrir,et al. Hash, Don't Cache (the Page Table) , 2016, SIGMETRICS.

[35] Alan L. Cox,et al. Practical, transparent operating system support for superpages , 2002, OPSR.

[36] Dong Tang,et al. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[37] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[38] Srilatha Manne,et al. Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[39] Youngjin Kwon,et al. Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[40] Michael Hamburg,et al. Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[41] Jaehyuk Huh,et al. Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[42] Ján Veselý,et al. Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).