Learning-based Memory Allocation for C++ Server Workloads

Modern C++ servers have memory footprints that vary widely over time, causing persistent heap fragmentation of up to 2x from long-lived objects allocated during peak memory usage. This fragmentation is exacerbated by the use of huge (2MB) pages, a requirement for high performance on large heap sizes. Reducing fragmentation automatically is challenging because C++ memory managers cannot move objects. This paper presents a new approach to huge page fragmentation. It combines modern machine learning techniques with a novel memory manager (LLAMA) that manages the heap based on object lifetimes and huge pages (divided into blocks and lines). A neural network-based language model predicts lifetime classes using symbolized calling contexts. The model learns context-sensitive per-allocation site lifetimes from previous runs, generalizes over different binary versions, and extrapolates from samples to unobserved calling contexts. Instead of size classes, LLAMA's heap is organized by lifetime classes that are dynamically adjusted based on observed behavior at a block granularity. LLAMA reduces memory fragmentation by up to 78% while only using huge pages on several production servers. We address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution. Although our results focus on memory allocation, the questions we identify apply to other system-level problems with strict latency and resource requirements where machine learning could be applied.

[1]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[2]  Lu Fang,et al.  FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications , 2015, ASPLOS.

[3]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[4]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[5]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[6]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[7]  Jin-Soo Kim,et al.  Controlling physical memory fragmentation in mobile systems , 2015, ISMM.

[8]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[9]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[11]  Kathryn S. McKinley,et al.  Age-based garbage collection , 1999, OOPSLA '99.

[12]  Gu-Yeon Wei,et al.  Mallacc: Accelerating Memory Allocation , 2017, ASPLOS.

[13]  Kathryn S. McKinley,et al.  Pretenuring for Java , 2001, OOPSLA '01.

[14]  Kathryn S. McKinley,et al.  Dynamic object sampling for pretenuring , 2004, ISMM '04.

[15]  Xi Yang,et al.  Taking off the gloves with reference counting Immix , 2013, OOPSLA.

[16]  Andrew McGregor,et al.  Mesh: compacting memory management for C/C++ applications , 2019, PLDI.

[17]  Duarte Patrício,et al.  Runtime Object Lifetime Profiler for Latency Sensitive Big Data Applications , 2019, EuroSys.

[18]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[19]  David Detlefs,et al.  Garbage-first garbage collection , 2004, ISMM '04.

[20]  Christopher Olston,et al.  TensorFlow-Serving: Flexible, High-Performance ML Serving , 2017, ArXiv.

[21]  David A. Cohn,et al.  Predicting Lifetimes in Dynamically Allocated Memory , 1996, NIPS.

[22]  Michael D. Bond,et al.  Efficient context sensitivity for dynamic analyses via calling context uptrees and customized memory management , 2013, OOPSLA.

[23]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[24]  Kathryn S. McKinley,et al.  Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance , 2008, PLDI '08.

[25]  David M. Ungar,et al.  Generation Scavenging: A non-disruptive high performance storage reclamation algorithm , 1984, SDE 1.

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[28]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[29]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[30]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[31]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[32]  Bradley C. Kuszmaul SuperMalloc: a super fast multithreaded malloc for 64-bit machines , 2015, ISMM.

[33]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[34]  Amer Diwan,et al.  Inferred call path profiling , 2009, OOPSLA '09.

[35]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[36]  Michael D. Bond,et al.  Breadcrumbs: efficient context sensitivity for dynamic bug detection analyses , 2010, PLDI '10.

[37]  Kathryn S. McKinley,et al.  Beltway: getting around garbage collection gridlock , 2002, PLDI '02.

[38]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.

[39]  P ChangPohua,et al.  Profile-guided automatic inline expansion for C programs , 1992 .

[40]  Benjamin G. Zorn,et al.  Using lifetime predictors to improve memory allocation performance , 1993, PLDI '93.

[41]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[42]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Hannes Payer,et al.  Memento mori: dynamic allocation-site-based optimizations , 2015, ISMM.

[45]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[46]  Michael M. Swift,et al.  Devirtualizing Memory in Heterogeneous Systems , 2018, ASPLOS.