Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator

Memory allocation represents significant compute cost at the warehouse scale and its optimization can yield considerable cost savings. One classical approach is to increase the efficiency of an allocator to minimize the cycles spent in the allocator code. However, memory allocation decisions also impact overall application performance via data placement, offering opportunities to improve fleetwide productivity by completing more units of application work using fewer hardware resources. Here, we focus on hugepage coverage. We present TEMERAIRE, a hugepage-aware enhancement of TCMALLOC to reduce CPU overheads in the application’s code. We discuss the design and implementation of TEMERAIRE including strategies for hugepage-aware memory layouts to maximize hugepage coverage and to minimize fragmentation overheads. We present application studies for 8 applications, improving requests-per-second (RPS) by 7.7% and reducing RAM usage 2.4%. We present the results of a 1% experiment at fleet scale as well as the longitudinal rollout in Google’s warehouse scale computers. This yielded 6% fewer TLB miss stalls, and 26% reduction in memory wasted due to fragmentation. We conclude with a discussion of additional techniques for improving the allocator development process and potential optimization strategies for future memory allocators.

[1]  J. Armstrong Patterns of Conflict , 2020 .

[2]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[3]  Gu-Yeon Wei,et al.  Mallacc: Accelerating Memory Allocation , 2017, ASPLOS.

[4]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[5]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[6]  Rivalino Matias,et al.  An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[7]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[8]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[9]  K. Gopinath,et al.  HawkEye: Efficient Fine-grained OS Support for Huge Pages , 2019, ASPLOS.

[10]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[11]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[12]  Daan Leijen,et al.  Mimalloc: Free List Sharding in Action , 2019, APLAS.

[13]  Bradley C. Kuszmaul SuperMalloc: a super fast multithreaded malloc for 64-bit machines , 2015, ISMM.

[14]  Jonathan Adams,et al.  Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources , 2001, USENIX Annual Technical Conference, General Track.

[15]  Jichuan Chang,et al.  Software-Defined Far Memory in Warehouse-Scale Computers , 2019, ASPLOS.

[16]  David Dice,et al.  The Influence of Malloc Placement on TSX Hardware Transactional Memory , 2015, ArXiv.

[17]  John Michael Robson,et al.  Worst Case Fragmentation of First Fit and Best Fit Storage Allocation Strategies , 1977, Comput. J..

[18]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[19]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Dimitrios S. Nikolopoulos,et al.  Scalable locality-conscious multithreaded memory allocation , 2006, ISMM '06.

[21]  Laurie A. Williams,et al.  Continuous Deployment at Facebook and OANDA , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[22]  Colin Raffel,et al.  Learning-based Memory Allocation for C++ Server Workloads , 2020, ASPLOS.

[23]  Jon Louis Bentley Tiny Experiments for Algorithms and Life , 2006, WEA.

[24]  Tipp Moseley,et al.  AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[25]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[26]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[27]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[28]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[29]  Akshitha Sriraman,et al.  Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale , 2020, ASPLOS.

[30]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[31]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[32]  Timothy M. Jones,et al.  HALO: post-link heap-layout optimisation , 2020, CGO.

[33]  David A. Wood,et al.  IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[34]  Kathryn S. McKinley,et al.  Adaptive huge-page subrelease for non-moving memory allocators in warehouse-scale computers , 2021, ISMM.

[35]  Andrew McGregor,et al.  Mesh: compacting memory management for C/C++ applications , 2019, PLDI.

[36]  David Dice,et al.  Cache index-aware memory allocation , 2011, ISMM '11.

[37]  William D. Clinger,et al.  Generational garbage collection and the radioactive decay model , 1997, PLDI '97.

[38]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[39]  Marcus Jägemar,et al.  Mallocpool: Improving Memory Performance Through Contiguously TLB Mapped Memory , 2018, 2018 IEEE 23rd International Conference on Emerging Technologies and Factory Automation (ETFA).

[40]  Cecilia R. Aragon,et al.  Randomized search trees , 2005, Algorithmica.

[41]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.