Making Huge Pages Actually Useful

The virtual-to-physical address translation overhead, a major performance bottleneck for modern workloads, can be effectively alleviated with huge pages. However, since huge pages must be mapped contiguously, OSs have not been able to use them well because of the memory fragmentation problem despite hardware support for huge pages being available for nearly two decades. This paper presents a comprehensive study of the interaction of fragmentation with huge pages in the Linux kernel. We observe that when huge pages are used, problems such as high CPU utilization and latency spikes occur because of unnecessary work (e.g., useless page migration) performed by memory management related subsystems due to the poor handling of unmovable (i.e., kernel) pages. This behavior is even more harmful in virtualized systems where unnecessary work may be performed in both guest and host OSs. We present Illuminator, an efficient memory manager that provides various subsystems, such as the page allocator, the ability to track all unmovable pages. It allows subsystems to make informed decisions and eliminate unnecessary work which in turn leads to cost-effective huge page allocations. Illuminator reduces the cost of compaction (up to 99%), improves application performance (up to 2.3x) and reduces the maximum latency of MySQL database server (by 30x). Importantly, this work shows the effectiveness of a simple solution for long-standing huge page related problems.

[1]  Tianhao Zhang,et al.  Do-it-yourself virtual memory translation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  Rodolfo Pellizzoni,et al.  PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[3]  Yan Solihin,et al.  Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Paul E. McKenney,et al.  Extending RCU for Realtime and Embedded Workloads , 2006 .

[5]  George Neville-Neil,et al.  The Design and Implementation of the FreeBSD Operating System , 2014 .

[6]  Daniel G. Bobrow,et al.  Combining generational and conservative garbage collection: framework and implementations , 1989, POPL '90.

[7]  Jin-Soo Kim,et al.  Controlling physical memory fragmentation in mobile systems , 2015, ISMM.

[8]  Gustavo Alonso,et al.  Application level ballooning for efficient server consolidation , 2013, EuroSys '13.

[9]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[10]  K. Gopinath,et al.  A Case for Protecting Huge Pages from the Kernel , 2016, APSys.

[11]  Muli Ben-Yehuda,et al.  Page Fault Support for Network Controllers , 2017, ASPLOS.

[12]  K. Gopinath,et al.  Prudent Memory Reclamation in Procrastination-Based Synchronization , 2016, ASPLOS.

[13]  Patrick Healy,et al.  Performance characteristics of explicit superpage support , 2010, ISCA'10.

[14]  Nadav Amit,et al.  Optimizing the TLB Shootdown Algorithm with Page Access Tracking , 2017, USENIX Annual Technical Conference.

[15]  Erez Petrank,et al.  A parallel, incremental, mostly concurrent garbage collector for servers , 2005, TOPL.

[16]  Ján Veselý,et al.  Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  ジェームズ・エム・ロイター,et al.  Virtual memory system , 2001 .

[18]  Kathryn S. McKinley,et al.  Age-based garbage collection , 1999, OOPSLA '99.

[19]  Erez Petrank,et al.  A generational on-the-fly garbage collector for Java , 2000, PLDI '00.

[20]  Paul E. McKenney,et al.  Making RCU safe for deep sub-millisecond response realtime applications , 2004 .

[21]  Fei Guo,et al.  Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi , 2015, VEE.

[22]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[23]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[24]  H. Reza Taheri,et al.  Performance Implications of Extended Page Tables on Virtualized x86 Processors , 2016, VEE.

[25]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.

[26]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[27]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[28]  Indira Subramanian,et al.  Implementation of Multiple Pagesize Support in HP-UX , 1998, USENIX Annual Technical Conference.

[29]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[30]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[31]  Andy Whitcroft,et al.  The What, The Why and the Where To of Anti-Fragmentation , 2010 .

[32]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Patrick Healy,et al.  Supporting superpage allocation without additional hardware support , 2008, ISMM '08.

[34]  Guy L. Steele,et al.  Multiprocessing compactifying garbage collection , 1975, CACM.

[35]  Andy Whitcroft,et al.  Supporting the Allocation of Large Contiguous Regions of Memory , 2010 .

[36]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[37]  Henry Lieberman,et al.  A real-time garbage collector based on the lifetimes of objects , 1983, CACM.

[38]  Irfan Habib,et al.  Virtualization with KVM , 2008 .

[39]  Richard McDougall,et al.  Solaris Internals (2nd Edition) , 2006 .