ScatterAlloc: Massively parallel dynamic memory allocation for the GPU

In this paper, we analyze the special requirements of a dynamic memory allocator that is designed for massively parallel architectures such as Graphics Processing Units (GPUs). We show that traditional strategies, which work well on CPUs, are not well suited for the use on GPUs and present the thorough design of ScatterAlloc, which can efficiently deal with hundreds of requests in parallel. Our allocator greatly reduces collisions and congestion by scattering memory requests based on hashing. We analyze ScatterAlloc in terms of allocation speed, data access time and fragmentation, and compare it to current state-of-the-art allocators, including the one provided with the NVIDIA CUDA toolkit. Our results show, that ScatterAlloc clearly outperforms these other approaches, yielding speed-ups between 10 to 100.

[1]  David M. Nicol Inflated speedups in parallel simulations via malloc() , 1990 .

[2]  Ali-Reza Adl-Tabatabai,et al.  McRT-Malloc: a scalable transactional memory allocator , 2006, ISMM '06.

[3]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[4]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[5]  Per-Åke Larson,et al.  Memory allocation for long-running server applications , 1998, ISMM '98.

[6]  Stephen Jones,et al.  XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[7]  Philippas Tsigas,et al.  On sorting and load balancing on GPUs , 2009, CARN.

[8]  Kun Zhou,et al.  Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH Asia '08.

[9]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[10]  Dimitrios S. Nikolopoulos,et al.  Scalable locality-conscious multithreaded memory allocation , 2006, ISMM '06.

[11]  Jung Ho Ahn,et al.  Memory and control organizations of stream processors , 2007 .

[12]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[13]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[14]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[15]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Paul R. Wilson,et al.  Dynamic Storage Allocation: A Survey and Critical Review , 1995, IWMM.

[17]  Jaejin Lee,et al.  SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[18]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[19]  Alex Garthwaite,et al.  Mostly lock-free malloc , 2002, ISMM '02.

[20]  Lars Lundberg,et al.  Optimizing dynamic memory management in a multithreaded application executing on a multiprocessor , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[21]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[22]  Dinesh Manocha,et al.  gProximity: Hierarchical GPU‐based Operations for Collision and Distance Queries , 2010, Comput. Graph. Forum.