Reasoning About Foreign Function Interfaces Without Modelling the Foreign Language

Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. We discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type. To make fast GPU programming available to average programmers, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access. DynaSOAr improves the usage of allocated memory with a Structure of Arrays data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

[1]  Robert Strzodka,et al.  Abstraction for AoS and SoA layout in C , 2011 .

[2]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[3]  Michael Garland,et al.  Throughput-oriented GPU memory allocation , 2019, PPoPP.

[4]  Radek Stibora Building of SBVH on Graphical Hardware , 2016 .

[5]  Vernon Rego,et al.  Efficient Algorithms for Stream Compaction on GPUs , 2017, Int. J. Netw. Comput..

[6]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[7]  Stefania Bandini,et al.  Agent Based Modeling and Simulation: An Informatics Perspective , 2009, J. Artif. Soc. Soc. Simul..

[8]  Vlastimil Havran,et al.  Register Efficient Dynamic Memory Allocator for GPUs , 2015, Comput. Graph. Forum.

[9]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[10]  Yannis Manolopoulos,et al.  Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes , 2003, ADBIS.

[11]  James Abel,et al.  Applications Tuning for Streaming SIMD Extensions , 1999 .

[12]  John D. Owens,et al.  A Dynamic Hash Table for the GPU , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[13]  Thomas Fahringer,et al.  Automatic Data Layout Optimizations for GPUs , 2015, Euro-Par.

[14]  Kei Davis,et al.  Parallel Object-Oriented Scientific Computing Today , 2003, ECOOP Workshops.

[15]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[16]  M. Steinberger,et al.  ScatterAlloc: Massively parallel dynamic memory allocation for the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[17]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[18]  Sophia Drossopoulou,et al.  Extending SHAPES for SIMD Architectures: An approach to native support for Struct of Arrays in languages , 2018, ICOOOLPS@ECOOP.

[19]  Kenli Li,et al.  Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Michael Goesele,et al.  MATOG: Array Layout Auto-Tuning for CUDA , 2017, TACO.

[21]  Trevor Alexander Brown,et al.  Reclaiming Memory for Lock-Free Data Structures: There has to be a Better Way , 2015, PODC.

[22]  Michael Goesele,et al.  Fast dynamic memory allocator for massively parallel architectures , 2013, GPGPU@ASPLOS.

[23]  Carlchristian Eckert,et al.  Enhancements of the massively parallel memory allocator ScatterAlloc and its adaption to the general interface mallocMC , 2014 .

[24]  Rj Allan,et al.  Survey of Agent Based Modelling and Simulation Tools , 2009 .

[25]  Stephen John Turner,et al.  Supporting efficient execution of continuous space agent‐based simulation on GPU , 2016, Concurr. Comput. Pract. Exp..

[26]  Simon D. Hammond,et al.  Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[27]  Mark Moir,et al.  SNZI: scalable NonZero indicators , 2007, PODC '07.

[28]  Benjamin Keinert,et al.  Real-time local displacement using dynamic GPU memory management , 2013, HPG '13.

[29]  Efficient Neighbor Searching for Agent-Based Simulation on GPU , 2014, 2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications.

[30]  Sophia Drossopoulou,et al.  You can have it all: abstraction and good cache performance , 2017, Onward!.

[31]  CaiWentong,et al.  Supporting efficient execution of continuous space agent-based simulation on GPU , 2016 .

[32]  William N. Scherer,et al.  Nonblocking Concurrent Data Structures with Condition Synchronization , 2004, DISC.

[33]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[34]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[35]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[36]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[37]  Dirk Grunwald,et al.  Improving the cache locality of memory allocation , 1993, PLDI '93.

[38]  Andreas Polze,et al.  A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads , 2016, Int. J. Netw. Comput..

[39]  Marina Papatriantafilou,et al.  Lock-free Concurrent Data Structures , 2013, ArXiv.

[40]  Emery D. Berger,et al.  A locality-improving dynamic memory allocator , 2005, MSP '05.

[41]  Vasily Volkov,et al.  Understanding Latency Hiding on GPUs , 2016 .

[42]  Lionel Lacassagne,et al.  Batched Cholesky factorization for tiny matrices , 2016, 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP).

[43]  Vincent B. C. Tan,et al.  Adaptive floating node method for modelling cohesive fracture of composite materials , 2018 .

[44]  Dietmar Gallistl The adaptive finite element method , 2016 .

[45]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[46]  Joshua M. Epstein,et al.  Growing Artificial Societies: Social Science from the Bottom Up , 1996 .

[47]  Chuck Lever,et al.  Malloc() Performance in a Multithreaded Linux Environment , 2000, USENIX Annual Technical Conference, FREENIX Track.

[48]  Michael Schreckenberg,et al.  A cellular automaton model for freeway traffic , 1992 .

[49]  Stephen Jones,et al.  XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[50]  Stephen John Turner,et al.  Cloning Agent-based Simulation on GPU , 2015, SIGSIM-PADS.

[51]  Maged M. Michael Safe memory reclamation for dynamic lock-free objects using atomic reads and writes , 2002, PODC '02.

[52]  Hidehiko Masuhara,et al.  Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout , 2018, WPMVP@PPoPP.

[53]  Graham C. Archer,et al.  Object-Oriented Finite Element Analysis , 1996 .

[54]  S. Alexander,et al.  N-Body Simulations of Late Stage Planetary Formation with a Simple Fragmentation Model , 1998 .

[55]  Atsushi Ohori,et al.  An efficient non-moving garbage collector for functional languages , 2011, ICFP.

[56]  Julian Cummings,et al.  Comparison of C++ and Fortran 90 for object-oriented scientific programming , 1997 .

[57]  Sang-Hee Lee,et al.  Effects of wind and tree density on forest fire patterns in a mixed-tree species forest , 2017 .

[58]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[59]  Holger Homann,et al.  SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes , 2017, Comput. Phys. Commun..

[60]  Robert Hirschfeld,et al.  Columnar objects: improving the performance of analytical applications , 2015, Onward!.

[61]  Ana Lucia Varbanescu,et al.  KMA: A Dynamic Memory Manager for OpenCL , 2014, GPGPU@ASPLOS.