HALO: post-link heap-layout optimisation

Today, general-purpose memory allocators dominate the landscape of dynamic memory management. While these solutions can provide reasonably good behaviour across a wide range of workloads, it is an unfortunate reality that their behaviour for any particular workload can be highly suboptimal. By catering primarily to average and worst-case usage patterns, these allocators deny programs the advantages of domain-specific optimisations, and thus may inadvertently place data in a manner that hinders performance, generating unnecessary cache misses and load stalls. To help alleviate these issues, we propose HALO: a post-link profile-guided optimisation tool that can improve the layout of heap data to reduce cache misses automatically. Profiling the target binary to understand how allocations made in different contexts are related, we specialise memory-management routines to allocate groups of related objects from separate pools to increase their spatial locality. Unlike other solutions of its kind, HALO employs novel grouping and identification algorithms which allow it to create tight-knit allocation groups using the entire call stack and to identify these efficiently at runtime. Evaluation of HALO on contemporary out-of-order hardware demonstrates speedups of up to 28% over jemalloc, out-performing a state-of-the-art data placement technique from the literature.

[1]  Pen-Chung Yew,et al.  On improving heap memory layout by dynamic pool allocation , 2010, CGO '10.

[2]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[3]  Daan Leijen,et al.  Mimalloc: Free List Sharding in Action , 2019, APLAS.

[4]  Michael Franz,et al.  Automated data-member layout of heap objects to improve memory-hierarchy performance , 2000, TOPL.

[5]  François Bodin,et al.  Improving cache behavior of dynamically allocated data structures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[6]  Vikram S. Adve,et al.  Automatic pool allocation: improving performance by controlling data structure layout in the heap , 2005, PLDI '05.

[7]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[8]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[9]  Jason Evans Tick Tock, malloc Needs a Clock , 2015 .

[10]  David Dice,et al.  Cache index-aware memory allocation , 2011, ISMM '11.

[11]  Kathryn S. McKinley,et al.  Composing high-performance memory allocators , 2001, PLDI '01.

[12]  Andrew McGregor,et al.  Mesh: compacting memory management for C/C++ applications , 2019, PLDI.

[13]  Ian H. Witten,et al.  Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[14]  Lawrence Rauchwerger,et al.  Two memory allocators that use hints to improve locality , 2009, ISMM '09.

[15]  Benjamin G. Zorn,et al.  Using lifetime predictors to improve memory allocation performance , 1993, PLDI '93.

[16]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[17]  Trishul M. Chilimbi,et al.  Cache-conscious coallocation of hot data streams , 2006, PLDI '06.

[18]  Jason Gregory,et al.  Game Engine Architecture , 2009 .

[19]  Weng-Fai Wong,et al.  Dynamic memory optimization using pool allocation and prefetching , 2005, CARN.

[20]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[21]  Bradley C. Kuszmaul SuperMalloc: a super fast multithreaded malloc for 64-bit machines , 2015, ISMM.

[22]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[24]  Dirk Grunwald,et al.  Quantifying Behavioral Differences Between C and C++ Programs , 1994 .

[25]  J. Larus Whole program paths , 1999, PLDI '99.

[26]  Matthew L. Seidl,et al.  Segregating heap objects by reference behavior and lifetime , 1998, ASPLOS VIII.

[27]  Guilherme Ottoni,et al.  BOLT: A Practical Binary Optimizer for Data Centers and Beyond , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[28]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.