NextGen-Malloc: Giving Memory Allocator Its Own Room in the House

Memory allocation and management have a significant impact on performance and energy of modern applications. We observe that performance can vary by as much as 72% in some applications based on which memory allocator is used. Many current allocators are multi-threaded to support concurrent allocation requests from different threads. However, such multi-threading comes at the cost of maintaining complex metadata that is tightly coupled and intertwined with user data. When memory management functions and other user programs run on the same core, the metadata used by management functions may pollute the processor caches and other resources. In this paper, we make a case for offloading memory allocation (and other similar management functions) from main processing cores to other processing units to boost performance, reduce energy consumption, and customize services to specific applications or application domains. To offload these multi-threaded fine-granularity functions, we propose to decouple the metadata of these functions from the rest of application data to reduce the overhead of inter-thread metadata synchronization. We draw attention to the following key questions to realize this opportunity: (a) What are the tradeoffs and challenges in offloading memory allocation to a dedicated core? (b) Should we use general-purpose cores or special-purpose cores for executing critical system management functions? (c) Can this methodology apply to heterogeneous systems (e.g., with GPUs, accelerators) and other service functions as well?

[1]  S. Rixner,et al.  UVM Discard: Eliminating Redundant Memory Transfers for Accelerators , 2022, 2022 IEEE International Symposium on Workload Characterization (IISWC).

[2]  S. Kaxiras,et al.  Free atomics: hardware atomic operations without fences , 2022, ISCA.

[3]  Aditya Akella,et al.  Memory deduplication for serverless computing with Medes , 2022, EuroSys.

[4]  Peiyi Hong,et al.  NVAlloc: rethinking heap metadata management in persistent memory allocators , 2022, ASPLOS.

[5]  Rong Ge,et al.  In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  K. McKinley,et al.  Adaptive huge-page subrelease for non-moving memory allocators in warehouse-scale computers , 2021, ISMM.

[7]  Rodrigo Bruno,et al.  From warm to hot starts: leveraging runtimes for the serverless era , 2021, HotOS.

[8]  Christoforos E. Kozyrakis,et al.  SmartHarvest: harvesting idle CPUs safely and efficiently in the cloud , 2021, EuroSys.

[9]  Gagan Gupta,et al.  Performance Characterization of .NET Benchmarks , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Marios Kogias,et al.  Benchmarking, analysis, and optimization of serverless function snapshots , 2021, ASPLOS.

[11]  Andreas Gerstlauer,et al.  Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication , 2020, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]  Ricardo Bianchini,et al.  Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider , 2020, USENIX Annual Technical Conference.

[13]  Daan Leijen,et al.  Mimalloc: Free List Sharding in Action , 2019, APLAS.

[14]  John Kubiatowicz,et al.  A Hardware Accelerator for Tracing Garbage Collection , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[15]  Reena Panda,et al.  Wait of a Decade: Did SPEC CPU 2017 Broaden the Performance Horizon? , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[17]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[18]  Jee Ho Ryoo,et al.  Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[19]  Jee Ho Ryoo,et al.  Rethinking TLB Designs in Virtualized Environments , 2017 .

[20]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[21]  Gu-Yeon Wei,et al.  Mallacc: Accelerating Memory Allocation , 2017, ASPLOS.

[22]  Michael M. Swift,et al.  Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[24]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[25]  Decision of the European Court of Justice 11 July 2013 – Ca C-52111 “Amazon” , 2013, IIC - International Review of Intellectual Property and Competition Law.

[26]  Susmit Sarkar,et al.  Fast RMWs for TSO: semantics and implementation , 2013, PLDI.

[27]  Sanghoon Lee,et al.  MMT: Exploiting fine-grained parallelism in dynamic memory management , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[28]  Chuck Lever,et al.  Malloc() Performance in a Multithreaded Linux Environment , 2000, USENIX Annual Technical Conference, FREENIX Track.

[29]  Tipp Moseley,et al.  Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator , 2021, OSDI.

[30]  Joshua Fried,et al.  Caladan: Mitigating Interference at Microsecond Timescales , 2020, OSDI.

[31]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[32]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[33]  方华 google,我,萨娜 , 2006 .