Mallacc: Accelerating Memory Allocation

Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 um2 of silicon area, less than 0.006% of a typical high-performance processor core.

[1]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[2]  W. T. Comfort Multiword list items , 1964, CACM.

[3]  David Blaauw,et al.  A 28 nm Configurable Memory (TCAM/BCAM/SRAM) Using Push-Rule 6T Bit Cell Enabling Logic-in-Memory , 2016, IEEE Journal of Solid-State Circuits.

[4]  David Blaauw,et al.  A configurable TCAM/BCAM/SRAM using 28nm push-rule 6T bit cell , 2015, 2015 Symposium on VLSI Circuits (VLSI Circuits).

[5]  Rivalino Matias,et al.  An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[6]  J. Morris Chang,et al.  Architectural support for dynamic memory management , 2000, Proceedings 2000 International Conference on Computer Design.

[7]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[8]  Krishna M. Kavi,et al.  A Page-based Hybrid (Software-Hardware) Dynamic Memory Allocator , 2006, IEEE Computer Architecture Letters.

[9]  C. J. Stephenson,et al.  New methods for dynamic storage allocation (Fast Fits) , 1983, SOSP '83.

[10]  Sadiq M. Sait,et al.  A high-performance hardware-efficient memory allocation technique and design , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[11]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[12]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[13]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Kenneth C. Knowlton,et al.  A fast storage allocator , 1965, CACM.

[15]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[16]  Gu-Yeon Wei,et al.  XIOSim: power-performance modeling of mobile x86 cores , 2012, ISLPED '12.

[17]  J. Morris Chang,et al.  A High-Performance Memory Allocator for Object-Oriented Systems , 1996, IEEE Trans. Computers.

[18]  George O. Collins,et al.  Experience in automatic storage allocation , 1961, Commun. ACM.

[19]  References , 1971 .

[20]  Paul R. Wilson,et al.  Dynamic Storage Allocation: A Survey and Critical Review , 1995, IWMM.

[21]  Afrin Naz,et al.  Feasibility of decoupling memory management from the execution pipeline , 2007, J. Syst. Archit..

[22]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[23]  Gu-Yeon Wei,et al.  The Aladdin Approach to Accelerator Design and Modeling , 2015, IEEE Micro.

[24]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[25]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[26]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).