Memory-Efficient Object-Oriented Programming on GPUs

Object-oriented programming is often regarded as too inefficient for high-performance computing (HPC), despite the fact that many important HPC problems have an inherent object structure. Our goal is to bring efficient, object-oriented programming to massively parallel SIMD architectures, especially GPUs. In this thesis, we develop various techniques for optimizing object-oriented GPU code. Most notably, we identify the object-oriented Single-Method Multiple-Objects (SMMO) programming model. We first develop an embedded C++ Structure of Arrays (SOA) data layout DSL for SMMO applications. We then design a lock-free, dynamic memory allocator that stores allocations in SOA layout. Finally, we show how to further optimize the memory access of SMMO applications with memory defragmentation.

[1]  S. Alexander,et al.  N-Body Simulations of Late Stage Planetary Formation with a Simple Fragmentation Model , 1998 .

[2]  Benjamin C. Pierce,et al.  Types and programming languages: the next generation , 2003, 18th Annual IEEE Symposium of Logic in Computer Science, 2003. Proceedings..

[3]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[4]  Atsushi Ohori,et al.  An efficient non-moving garbage collector for functional languages , 2011, ICFP.

[5]  J. Schank,et al.  Biota: an object-oriented tool for modeling complex ecological systems , 1994 .

[6]  Julian Cummings,et al.  Comparison of C++ and Fortran 90 for object-oriented scientific programming , 1997 .

[7]  Joshua M. Epstein,et al.  Growing Artificial Societies: Social Science from the Bottom Up , 1996 .

[8]  Dirk Grunwald,et al.  Improving the cache locality of memory allocation , 1993, PLDI '93.

[9]  Andreas Polze,et al.  A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads , 2016, Int. J. Netw. Comput..

[10]  Ken Friis Larsen,et al.  Design and GPGPU performance of Futhark's redomap construct , 2016, ARRAY@PLDI.

[11]  Lukas Stadler,et al.  Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation , 2017, VEE.

[12]  Marina Papatriantafilou,et al.  Lock-free Concurrent Data Structures , 2013, ArXiv.

[13]  Mark Harman,et al.  An Empirical Investigation of the Influence of a Type of Side Effects on Program Comprehension , 2003, IEEE Trans. Software Eng..

[14]  M. Schreckenberg,et al.  Microscopic Simulation of Urban Traffic Based on Cellular Automata , 1997 .

[15]  Tor M. Aamodt,et al.  MIMD synchronization on SIMT architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Dragan A. Savić,et al.  An investigation of the efficient implementation of cellular automata on multi-core CPU and GPU hardware , 2015, J. Parallel Distributed Comput..

[18]  Michael Goesele,et al.  MATOG: Array Layout Auto-Tuning for CUDA , 2017, TACO.

[19]  Peng Tu,et al.  Writing scalable SIMD programs with ISPC , 2014, WPMVP '14.

[20]  Mingyu Chen,et al.  Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[21]  Michael Philippsen,et al.  Object Support for OpenMP-style Programming of GPU Clusters in Java , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[22]  Michael Goesele,et al.  Auto-Tuning Complex Array Layouts for GPUs , 2014, EGPGV@EuroVis.

[23]  Mitsuhisa Sato,et al.  A Source-to-Source OpenACC Compiler for CUDA , 2013, Euro-Par Workshops.

[24]  Trevor Alexander Brown,et al.  Reclaiming Memory for Lock-Free Data Structures: There has to be a Better Way , 2015, PODC.

[25]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[26]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[27]  Yannis Manolopoulos,et al.  Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes , 2003, ADBIS.

[28]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[29]  Chuck Lever,et al.  Malloc() Performance in a Multithreaded Linux Environment , 2000, USENIX Annual Technical Conference, FREENIX Track.

[30]  John D. Owens,et al.  A Dynamic Hash Table for the GPU , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Jeongnim Kim,et al.  Optimization and Parallelization of B-Spline Based Orbital Evaluations in QMC on Multi/Many-Core Shared Memory Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[32]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[33]  Michael Schreckenberg,et al.  A cellular automaton model for freeway traffic , 1992 .

[34]  Xiaogang Ruan,et al.  APPLICATIONS OF CELLULAR AUTOMATA IN COMPLEX SYSTEM STUDY , 2005 .

[35]  Hanspeter Mössenböck,et al.  Automatic array inlining in java virtual machines , 2008, CGO '08.

[36]  Efficient Neighbor Searching for Agent-Based Simulation on GPU , 2014, 2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications.

[37]  Satish Narayanasamy,et al.  Efficiently enforcing strong memory ordering in GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[39]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40]  Debra S Elston A Primer for Agent-Based Simulation and Modeling in Transportation Applications , 2013 .

[41]  J. M. Baveco,et al.  Objects for Simulation: Smalltalk and Ecology* , 1994, Simul..

[42]  Duane Merrill,et al.  Single-pass Parallel Prefix Scan with Decoupled Lookback , 2016 .

[43]  Michael Schreckenberg,et al.  A cellular automaton traffic flow model for online simulation of traffic , 2001, Parallel Comput..

[44]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[45]  Lieven Eeckhout,et al.  Object-Relative Addressing: Compressed Pointers in 64-Bit Java Virtual Machines , 2007, ECOOP.

[46]  Bernard Lang,et al.  Incremental incrementally compacting garbage collection , 1987, PLDI.

[47]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[48]  Hideya Iwasaki,et al.  A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming , 2009, APLAS.

[49]  R. Sivanandan,et al.  DEVELOPMENT OF MICROSCOPIC SIMULATION MODEL FOR HETEROGENEOUS TRAFFIC USING OBJECT ORIENTED APPROACH , 2008 .

[50]  Christian Wimmer,et al.  One VM to rule them all , 2013, Onward!.

[51]  J. G. Ferreira,et al.  ECOWIN — an object-oriented ecological model for aquatic ecosystems , 1995 .

[52]  Nathaniel Nystrom,et al.  Firepile: run-time compilation for GPUs in scala , 2011, GPCE '11.

[53]  Gerhard Wellein,et al.  Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips , 2014, WPMVP '14.

[54]  James O. Coplien,et al.  Curiously recurring template patterns , 1995 .

[55]  Masao Kuwahara,et al.  A development of a traffic simulator for urban road networks: AVENUE , 1994, Proceedings of VNIS'94 - 1994 Vehicle Navigation and Information Systems Conference.

[56]  Rajkishore Barik,et al.  Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[57]  Henk Corporaal,et al.  Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.

[58]  Mark Lee,et al.  Vectorized production path tracing , 2017, High Performance Graphics.

[59]  Sophia Drossopoulou,et al.  You can have it all: abstraction and good cache performance , 2017, Onward!.

[60]  Andrew S. Grimshaw,et al.  High-Performance and Scalable GPU Graph Traversal , 2015, ACM Trans. Parallel Comput..

[61]  Sudip K. Seal,et al.  Efficient simulation of agent-based models on multi-GPU and multi-core clusters , 2010, SimuTools.

[62]  Lionel Lacassagne,et al.  Batched Cholesky factorization for tiny matrices , 2016, 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP).

[63]  Jeannette M. Wing,et al.  A behavioral notion of subtyping , 1994, TOPL.

[64]  Vincent B. C. Tan,et al.  Adaptive floating node method for modelling cohesive fracture of composite materials , 2018 .

[65]  Erez Petrank,et al.  The Compressor: concurrent, incremental, and parallel compaction , 2006, PLDI '06.

[66]  Stephen Jones,et al.  XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[67]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[68]  Michael Franz,et al.  Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization , 2018, ECOOP.

[69]  Stanley B. Lippman C++ gems , 1996 .

[70]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[71]  D. Quinlan,et al.  ROSE: Compiler Support for Object-Oriented Frameworks , 1999, Parallel Process. Lett..

[72]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[73]  Roshan M. D'Souza,et al.  A Framework for Megascale Agent Based Model Simulations on Graphics Processing Units , 2008, J. Artif. Soc. Soc. Simul..

[74]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[75]  Kei Davis,et al.  Parallel Object-Oriented Scientific Computing Today , 2003, ECOOP Workshops.

[76]  Martín Abadi,et al.  Dynamic typing in a statically-typed language , 1989, POPL '89.

[77]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[78]  Xinxin Mei,et al.  Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[79]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[80]  Joshua M. Epstein,et al.  Growing Artificial Societies: Social Science from the Bottom Up , 1996 .

[81]  Robert Strzodka,et al.  Abstraction for AoS and SoA layout in C , 2011 .

[82]  Firas Hamze,et al.  A Performance Comparison of CUDA and OpenCL , 2010, ArXiv.

[83]  Radek Stibora Building of SBVH on Graphical Hardware , 2016 .

[84]  Vernon Rego,et al.  Efficient Algorithms for Stream Compaction on GPUs , 2017, Int. J. Netw. Comput..

[85]  簡聰富,et al.  物件導向軟體之架構(Object-Oriented Software Construction)探討 , 1989 .

[86]  Vlastimil Havran,et al.  Register Efficient Dynamic Memory Allocator for GPUs , 2015, Comput. Graph. Forum.

[87]  Shigeru Chiba,et al.  A metaobject protocol for C++ , 1995, OOPSLA.

[88]  Stephen John Turner,et al.  Supporting efficient execution of continuous space agent‐based simulation on GPU , 2016, Concurr. Comput. Pract. Exp..

[89]  Glenn Krasner,et al.  Smalltalk-80: bits of history, words of advice , 1983 .

[90]  Simon D. Hammond,et al.  Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[91]  Mark Moir,et al.  SNZI: scalable NonZero indicators , 2007, PODC '07.

[92]  Fan Yao,et al.  XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs , 2019, HPDC.

[93]  R. D'Souza SUGARSCAPE ON STEROIDS : SIMULATING OVER A MILLION AGENTS , 2007 .

[94]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[95]  Stephen John Turner,et al.  Cloning Agent-based Simulation on GPU , 2015, SIGSIM-PADS.

[96]  Daniel H. H. Ingalls A Simple Technique for Handling Multiple Polymorphism , 1986, OOPSLA.

[97]  Maged M. Michael Safe memory reclamation for dynamic lock-free objects using atomic reads and writes , 2002, PODC '02.

[98]  William R. Cook,et al.  Mixin-based inheritance , 1990, OOPSLA/ECOOP '90.

[99]  M. Steinberger,et al.  ScatterAlloc: Massively parallel dynamic memory allocation for the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[100]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[101]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[102]  Holger Homann,et al.  SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes , 2017, Comput. Phys. Commun..

[103]  Robert Hirschfeld,et al.  Columnar objects: improving the performance of analytical applications , 2015, Onward!.

[104]  Michael Goesele,et al.  Fast dynamic memory allocator for massively parallel architectures , 2013, GPGPU@ASPLOS.

[105]  Carlchristian Eckert,et al.  Enhancements of the massively parallel memory allocator ScatterAlloc and its adaption to the general interface mallocMC , 2014 .

[106]  Viera K. Proulx Traffic simulation: a case study for teaching object oriented design , 1998, SIGCSE '98.

[107]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[108]  Rj Allan,et al.  Survey of Agent Based Modelling and Simulation Tools , 2009 .

[109]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[110]  Paul D. Gilbert Creating Stand-Alone Smalltalk Applications , 1988 .

[111]  Peter Wegner,et al.  Concepts and paradigms of object-oriented programming , 1990, OOPS.

[112]  Alastair F. Donaldson,et al.  Exposing errors related to weak memory in GPU applications , 2016, PLDI.

[113]  Trevor Brown,et al.  Techniques for Constructing Efficient Lock-free Data Structures , 2017, ArXiv.

[114]  Fatos Xhafa,et al.  Programming multi-core and many-core computing systems , 2014 .

[115]  Dirk Riehle,et al.  Value object , 2006, PLoP '06.

[116]  Benjamin Keinert,et al.  Real-time local displacement using dynamic GPU memory management , 2013, HPG '13.

[117]  William Silvert,et al.  Object-oriented ecosystem modelling , 1993 .

[118]  Michel Steuwer,et al.  A Composable Array Function Interface for Heterogeneous Computing in Java , 2014, ARRAY@PLDI.

[119]  Thomas Fahringer,et al.  Automatic Data Layout Optimizations for GPUs , 2015, Euro-Par.

[120]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[121]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[122]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[123]  Vivek Sarkar,et al.  Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[124]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[125]  Jingyue Wu,et al.  gpucc: An open-source GPGPU compiler , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[126]  Urs Hölzle,et al.  Eliminating Virtual Function Calls in C++ Programs , 1996, ECOOP.

[127]  Erez Petrank,et al.  An efficient parallel heap compaction algorithm , 2004, OOPSLA.

[128]  Kunle Olukotun,et al.  Building-Blocks for Performance Oriented DSLs , 2011, DSL.

[129]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[130]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[131]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[132]  Bart De Moor,et al.  Transportation Planning and Traffic Flow Models , 2005 .

[133]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[134]  Ryan Newton,et al.  Region-based memory management for GPU programming languages: enabling rich data structures on a spartan host , 2014, OOPSLA.

[135]  Andrew A. Chien,et al.  An automatic object inlining optimization and its evaluation , 2000, PLDI '00.

[136]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[137]  Vivek Sarkar,et al.  Compiler-Driven Data Layout Transformation for Heterogeneous Platforms , 2013, Euro-Par Workshops.

[138]  Sang-Hee Lee,et al.  Effects of wind and tree density on forest fire patterns in a mixed-tree species forest , 2017 .

[139]  Sebastian Hack,et al.  Sierra: a SIMD extension for C++ , 2014, WPMVP '14.

[140]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[141]  Iisakki Kosonen HUTSIM: SIMULATION TOOL FOR TRAFFIC SIGNAL CONTROL PLANNING , 1996 .

[142]  Paul R. Wilson,et al.  The memory fragmentation problem: solved? , 1998, ISMM '98.

[143]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[144]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[145]  Matthias Felleisen Functional Objects , 2004, ECOOP.

[146]  James Abel,et al.  Applications Tuning for Streaming SIMD Extensions , 1999 .

[147]  Yoav Ossia,et al.  Mostly concurrent compaction for mark-sweep GC , 2004, ISMM '04.

[148]  Elliott W. Montroll,et al.  Nonlinear Population Dynamics. (Book Reviews: On the Volterra and Other Nonlinear Models of Interacting Populations) , 1971 .

[149]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[150]  Debasis Das A Survey on Cellular Automata and Its Applications , 2011 .

[151]  Sophia Drossopoulou,et al.  Extending SHAPES for SIMD Architectures: An approach to native support for Struct of Arrays in languages , 2018, ICOOOLPS@ECOOP.

[152]  Kenli Li,et al.  Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[153]  Ulrich Rüde,et al.  Expression Templates Revisited: A Performance Analysis of Current Methodologies , 2011, SIAM J. Sci. Comput..

[154]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[155]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[156]  Joseph Kehoe The Specification of Sugarscape , 2015, ArXiv.

[157]  Michael Philippsen,et al.  Parallel memory defragmentation on a GPU , 2012, MSPC '12.

[158]  Paul W. Rendell,et al.  Game of Life Universal Turing Machine , 2016 .

[159]  Ching-Lung Su,et al.  Overview and comparison of OpenCL and CUDA technology for GPGPU , 2012, 2012 IEEE Asia Pacific Conference on Circuits and Systems.

[160]  Marc Snir,et al.  Transformation for class immutability , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[161]  Xiaoming Li,et al.  CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator , 2009, 2009 International Conference on Parallel Processing Workshops.

[162]  Vasily Volkov,et al.  Understanding Latency Hiding on GPUs , 2016 .

[163]  Bjarne Stroustrup Foundations of C++ , 2012, ESOP.

[164]  Stefan Hanenberg,et al.  How do API documentation and static typing affect API usability? , 2014, ICSE.

[165]  Ana Lucia Varbanescu,et al.  KMA: A Dynamic Memory Manager for OpenCL , 2014, GPGPU@ASPLOS.

[166]  Laxmi N. Bhuyan,et al.  Efficient warp execution in presence of divergence with collaborative context collection , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[167]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[168]  Amos O. Olagunju,et al.  The Benefits of Object-oriented Methodology for Software Development , 2015 .

[169]  Timothy G. Rogers,et al.  Characterizing the Runtime Effects of Object-Oriented Workloads on GPUs , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[170]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[171]  Ingo Wald,et al.  Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[172]  Robert Strzodka Data layout optimization for multi-valued containers in OpenCL , 2012, J. Parallel Distributed Comput..

[173]  Michael Garland,et al.  Throughput-oriented GPU memory allocation , 2019, PPoPP.

[174]  Jeff Bonwick,et al.  The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.

[175]  Stefania Bandini,et al.  Agent Based Modeling and Simulation: An Informatics Perspective , 2009, J. Artif. Soc. Soc. Simul..