SoaAlloc: A Lock-free Hierarchical Bitmap-based Object Allocator for GPUs

Designing dynamic memory allocators for GPUs is challenging because applications can issue allocation requests in a highly parallel fashion and memory access and the data layout must be optimized to achieve good memory bandwidth utilization. Despite recent advances in GPU computing, current memory allocators for SIMD architectures are still not suitable for structured data because they fail to incorporate well-known best practices for optimizing memory access. Therefore, we developed SoaAlloc, a new dynamic object allocator for GPUs. Besides delivering competitive raw (de)allocation performance, SoaAlloc improves the usage of allocated memory with a Structure of Arrays (SOA) data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. The SOA layout alone results in a 2x speedup of application code over state-of-the-art allocators in our benchmarks. Furthermore, SoaAlloc is the first GPU object allocator that provides a do-all operation, which is an important recurring pattern in high-performance code where parallelism is expressed over a set of objects.

[1]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[2]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[3]  Michael Goesele,et al.  Fast dynamic memory allocator for massively parallel architectures , 2013, GPGPU@ASPLOS.

[4]  Yannis Manolopoulos,et al.  Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes , 2003, ADBIS.

[5]  Carlchristian Eckert,et al.  Enhancements of the massively parallel memory allocator ScatterAlloc and its adaption to the general interface mallocMC , 2014 .

[6]  John D. Owens,et al.  A Dynamic Hash Table for the GPU , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[7]  Rj Allan,et al.  Survey of Agent Based Modelling and Simulation Tools , 2009 .

[8]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[9]  Stephen Jones,et al.  XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[10]  Holger Homann,et al.  SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes , 2017, Comput. Phys. Commun..

[11]  Michael Goesele,et al.  MATOG: Array Layout Auto-Tuning for CUDA , 2017, TACO.

[12]  Andreas Polze,et al.  A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads , 2016, Int. J. Netw. Comput..

[13]  Stefania Bandini,et al.  Agent Based Modeling and Simulation: An Informatics Perspective , 2009, J. Artif. Soc. Soc. Simul..

[14]  Xiaoming Li,et al.  CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator , 2009, 2009 International Conference on Parallel Processing Workshops.

[15]  M. Steinberger,et al.  ScatterAlloc: Massively parallel dynamic memory allocation for the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[16]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[17]  Vasily Volkov,et al.  Understanding Latency Hiding on GPUs , 2016 .

[18]  Kenli Li,et al.  Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Marina Papatriantafilou,et al.  Lock-free Concurrent Data Structures , 2013, ArXiv.

[20]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[21]  Stephen John Turner,et al.  Supporting efficient execution of continuous space agent‐based simulation on GPU , 2016, Concurr. Comput. Pract. Exp..

[22]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[23]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[24]  Chuck Lever,et al.  Malloc() Performance in a Multithreaded Linux Environment , 2000, USENIX Annual Technical Conference, FREENIX Track.

[25]  Efficient Neighbor Searching for Agent-Based Simulation on GPU , 2014, 2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications.

[26]  Sophia Drossopoulou,et al.  You can have it all: abstraction and good cache performance , 2017, Onward!.

[27]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[28]  Kei Davis,et al.  Parallel Object-Oriented Scientific Computing Today , 2003, ECOOP Workshops.

[29]  Thomas Fahringer,et al.  Automatic Data Layout Optimizations for GPUs , 2015, Euro-Par.

[30]  Emery D. Berger,et al.  A locality-improving dynamic memory allocator , 2005, MSP '05.

[31]  Duane Merrill,et al.  Single-pass Parallel Prefix Scan with Decoupled Lookback , 2016 .

[32]  Ana Lucia Varbanescu,et al.  KMA: A Dynamic Memory Manager for OpenCL , 2014, GPGPU@ASPLOS.

[33]  Stephen John Turner,et al.  Cloning Agent-based Simulation on GPU , 2015, SIGSIM-PADS.

[34]  Robert Strzodka,et al.  Abstraction for AoS and SoA layout in C , 2011 .

[35]  Radek Stibora Building of SBVH on Graphical Hardware , 2016 .

[36]  Vernon Rego,et al.  Efficient Algorithms for Stream Compaction on GPUs , 2017, Int. J. Netw. Comput..

[37]  Vlastimil Havran,et al.  Register Efficient Dynamic Memory Allocator for GPUs , 2015, Comput. Graph. Forum.

[38]  Hidehiko Masuhara,et al.  Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout , 2018, WPMVP@PPoPP.

[39]  Atsushi Ohori,et al.  An efficient non-moving garbage collector for functional languages , 2011, ICFP.

[40]  Julian Cummings,et al.  Comparison of C++ and Fortran 90 for object-oriented scientific programming , 1997 .