Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on such multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case for explicit page-table allocation policies and show that page-table placement is becoming crucial to overall performance. We propose Mitosis to mitigate NUMA effects on page-table walks by transparently replicating and migrating page-tables across sockets without application changes. This reduces the frequency of accesses to remote NUMA nodes when performing page-table walks. Mitosis uses two components: (i) a mechanism to efficiently enable and (ii) policies to effectively control -- page-table replication and migration. We implement Mitosis in Linux and evaluate its benefits on real hardware. Mitosis improves performance for large-scale multi-socket workloads by up to 1.34x by replicating page-tables across sockets. Moreover, it improves performance by up to 3.24x in cases when the OS migrates a process across sockets by enabling cross-socket page-table migration.

[1]  Natalie D. Enright Jerger,et al.  Enabling interposer-based disintegration of multi-core processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  K. Gopinath,et al.  HawkEye: Efficient Fine-grained OS Support for Huge Pages , 2019, ASPLOS.

[3]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[4]  Natalie D. Enright Jerger,et al.  Modular Routing Design for Chiplet-Based Systems , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[5]  Abhishek Bhattacharjee,et al.  Translation-Triggered Prefetching , 2017, ASPLOS.

[6]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[8]  Xin Tong,et al.  Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[9]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[10]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[11]  Timothy Roscoe,et al.  Shoal: Smart Allocation and Replication of Memory For Parallel Programs , 2015, USENIX Annual Technical Conference.

[12]  Subramanian S. Iyer,et al.  Heterogeneous Integration for Performance and Scaling , 2016, IEEE Transactions on Components, Packaging and Manufacturing Technology.

[13]  Osman S. Unsal,et al.  Range Translations for Fast Virtual Memory , 2016, IEEE Micro.

[14]  Willy Zwaenepoel,et al.  The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS , 2018, USENIX Annual Technical Conference.

[15]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[16]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[18]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[19]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.

[20]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[21]  Tianhao Zhang,et al.  Do-it-yourself virtual memory translation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[22]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  Rami G. Melhem,et al.  Supporting superpages in non-contiguous physical memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[24]  Divyakant Agrawal,et al.  Albatross: Lightweight Elasticity in Shared Storage Databases for the Cloud using Live Data Migration , 2011, Proc. VLDB Endow..

[25]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[26]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[27]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[28]  Ján Veselý,et al.  Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[30]  Narayanan Ganapathy,et al.  General Purpose Operating System Support for Multiple Page Sizes , 1998, USENIX Annual Technical Conference.

[31]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[32]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[33]  Michael M. Swift,et al.  Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[34]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[35]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  Nikolaos Hardavellas,et al.  Galaxy: a high-performance energy-efficient multi-chip architecture using photonic interconnects , 2014, ICS '14.

[37]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[38]  Michael M. Swift,et al.  Devirtualizing Memory in Heterogeneous Systems , 2018, ASPLOS.

[39]  Michael M. Swift,et al.  Agile Paging for Efficient Memory Virtualization , 2017, IEEE Micro.

[40]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[41]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[42]  Vivien Quéma,et al.  The Linux scheduler: a decade of wasted cores , 2016, EuroSys.

[43]  M. Frans Kaashoek,et al.  RadixVM: scalable address spaces for multithreaded applications , 2013, EuroSys '13.

[44]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[45]  Zhen Fang,et al.  Reevaluating online superpage promotion with hardware support , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[46]  Marcos K. Aguilera,et al.  Black-box Concurrent Data Structures for NUMA Architectures , 2017, ASPLOS.

[47]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[48]  Osman S. Unsal,et al.  Performance analysis of the memory management unit under scale-out workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[49]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.