Adaptive huge-page subrelease for non-moving memory allocators in warehouse-scale computers

Modern C++ server workloads rely on 2 MB huge pages to improve memory system performance via higher TLB hit rates. Huge pages have traditionally been supported at the kernel level, but recent work has shown that user-level, huge page-aware memory allocators can achieve higher huge page coverage and thus performance. These memory allocators deal with a trade-off: 1) allocate memory from the operating system (OS) at the granularity of a huge page, achieve high performance, but potentially waste memory due to fragmentation, or 2) limit fragmentation by breaking up huge pages into smaller 4 KB pages and returning them to the OS, but reduce performance due to lower huge page coverage. For example, the state-of-the-art TCMalloc allocator handles this trade-off by releasing memory to the OS at a configurable release rate, breaking up huge pages as necessary. This approach balances performance and fragmentation well for machines running one workload. For multiple applications on the same machine however, the reduction in memory usage is only useful to overall performance if another workload uses this memory. In warehouse-scale computers, when an application releases and then reacquires the same amount or more memory quickly, but no other application uses the memory in the meantime, the release causes poorer huge page coverage without any system-wide benefit. We introduce a metric, realized fragmentation, to capture this effect. We then present an adaptive release policy that dynamically determines when to break up huge pages and return them to the OS to optimize system-wide performance. We built this policy into TCMalloc and deployed it fleet-wide in our data centers, leading to an estimated 1% fleet-wide throughput improvement at negligible memory overhead.

[1]  Kathryn S. McKinley,et al.  Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance , 2008, PLDI '08.

[2]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[3]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[4]  Bradley C. Kuszmaul SuperMalloc: a super fast multithreaded malloc for 64-bit machines , 2015, ISMM.

[5]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[6]  Andrew McGregor,et al.  Mesh: compacting memory management for C/C++ applications , 2019, PLDI.

[7]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[8]  Michael Wolf,et al.  C4: the continuously concurrent compacting collector , 2011, ISMM '11.

[9]  K. Gopinath,et al.  HawkEye: Efficient Fine-grained OS Support for Huge Pages , 2019, ASPLOS.

[10]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[11]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[12]  Tipp Moseley,et al.  Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator , 2021, OSDI.

[13]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[14]  Paul R. Wilson,et al.  The memory fragmentation problem: solved? , 1998, ISMM '98.

[15]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[16]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[17]  Colin Raffel,et al.  Learning-based Memory Allocation for C++ Server Workloads , 2020, ASPLOS.

[18]  Andrew Dinn,et al.  Shenandoah: An open-source concurrent compacting garbage collector for OpenJDK , 2016, PPPJ.

[19]  David Detlefs,et al.  Garbage-first garbage collection , 2004, ISMM '04.