论文信息 - Memory-Efficient Object-Oriented Programming on GPUs

Memory-Efficient Object-Oriented Programming on GPUs

Object-oriented programming is often regarded as too inefficient for high-performance computing (HPC), despite the fact that many important HPC problems have an inherent object structure. Our goal is to bring efficient, object-oriented programming to massively parallel SIMD architectures, especially GPUs. In this thesis, we develop various techniques for optimizing object-oriented GPU code. Most notably, we identify the object-oriented Single-Method Multiple-Objects (SMMO) programming model. We first develop an embedded C++ Structure of Arrays (SOA) data layout DSL for SMMO applications. We then design a lock-free, dynamic memory allocator that stores allocations in SOA layout. Finally, we show how to further optimize the memory access of SMMO applications with memory defragmentation.

Matthias Springer | M. Springer

[1] S. Alexander,et al. N-Body Simulations of Late Stage Planetary Formation with a Simple Fragmentation Model , 1998 .

[2] Benjamin C. Pierce,et al. Types and programming languages: the next generation , 2003, 18th Annual IEEE Symposium of Logic in Computer Science, 2003. Proceedings..

[3] Jure Leskovec,et al. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[4] Atsushi Ohori,et al. An efficient non-moving garbage collector for functional languages , 2011, ICFP.

[5] J. Schank,et al. Biota: an object-oriented tool for modeling complex ecological systems , 1994 .

[6] Julian Cummings,et al. Comparison of C++ and Fortran 90 for object-oriented scientific programming , 1997 .

[7] Joshua M. Epstein,et al. Growing Artificial Societies: Social Science from the Bottom Up , 1996 .

[8] Dirk Grunwald,et al. Improving the cache locality of memory allocation , 1993, PLDI '93.

[9] Andreas Polze,et al. A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads , 2016, Int. J. Netw. Comput..

[10] Ken Friis Larsen,et al. Design and GPGPU performance of Futhark's redomap construct , 2016, ARRAY@PLDI.

[11] Lukas Stadler,et al. Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation , 2017, VEE.

[12] Marina Papatriantafilou,et al. Lock-free Concurrent Data Structures , 2013, ArXiv.

[13] Mark Harman,et al. An Empirical Investigation of the Influence of a Type of Side Effects on Program Comprehension , 2003, IEEE Trans. Software Eng..

[14] M. Schreckenberg,et al. Microscopic Simulation of Urban Traffic Based on Cellular Automata , 1997 .

[15] Tor M. Aamodt,et al. MIMD synchronization on SIMT architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16] Mohamed Wahib,et al. Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Dragan A. Savić,et al. An investigation of the efficient implementation of cellular automata on multi-core CPU and GPU hardware , 2015, J. Parallel Distributed Comput..

[18] Michael Goesele,et al. MATOG: Array Layout Auto-Tuning for CUDA , 2017, TACO.

[19] Peng Tu,et al. Writing scalable SIMD programs with ISPC , 2014, WPMVP '14.

[20] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[21] Michael Philippsen,et al. Object Support for OpenMP-style Programming of GPU Clusters in Java , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[22] Michael Goesele,et al. Auto-Tuning Complex Array Layouts for GPUs , 2014, EGPGV@EuroVis.

[23] Mitsuhisa Sato,et al. A Source-to-Source OpenACC Compiler for CUDA , 2013, Euro-Par Workshops.

[24] Trevor Alexander Brown,et al. Reclaiming Memory for Lock-Free Data Structures: There has to be a Better Way , 2015, PODC.

[25] Maged M. Michael. Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[26] Ulf Assarsson,et al. Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[27] Yannis Manolopoulos,et al. Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes , 2003, ADBIS.

[28] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[29] Chuck Lever,et al. Malloc() Performance in a Multithreaded Linux Environment , 2000, USENIX Annual Technical Conference, FREENIX Track.

[30] John D. Owens,et al. A Dynamic Hash Table for the GPU , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31] Jeongnim Kim,et al. Optimization and Parallelization of B-Spline Based Orbital Evaluations in QMC on Multi/Many-Core Shared Memory Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[32] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[33] Michael Schreckenberg,et al. A cellular automaton model for freeway traffic , 1992 .

[34] Xiaogang Ruan,et al. APPLICATIONS OF CELLULAR AUTOMATA IN COMPLEX SYSTEM STUDY , 2005 .

[35] Hanspeter Mössenböck,et al. Automatic array inlining in java virtual machines , 2008, CGO '08.

[36] Efficient Neighbor Searching for Agent-Based Simulation on GPU , 2014, 2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications.

[37] Satish Narayanasamy,et al. Efficiently enforcing strong memory ordering in GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38] Ganesh Gopalakrishnan,et al. GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[39] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40] Debra S Elston. A Primer for Agent-Based Simulation and Modeling in Transportation Applications , 2013 .

[41] J. M. Baveco,et al. Objects for Simulation: Smalltalk and Ecology* , 1994, Simul..

[42] Duane Merrill,et al. Single-pass Parallel Prefix Scan with Decoupled Lookback , 2016 .

[43] Michael Schreckenberg,et al. A cellular automaton traffic flow model for online simulation of traffic , 2001, Parallel Comput..

[44] P. J. Narayanan,et al. Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[45] Lieven Eeckhout,et al. Object-Relative Addressing: Compressed Pointers in 64-Bit Java Virtual Machines , 2007, ECOOP.

[46] Bernard Lang,et al. Incremental incrementally compacting garbage collection , 1987, PLDI.

[47] Martin D. F. Wong,et al. An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[48] Hideya Iwasaki,et al. A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming , 2009, APLAS.

[49] R. Sivanandan,et al. DEVELOPMENT OF MICROSCOPIC SIMULATION MODEL FOR HETEROGENEOUS TRAFFIC USING OBJECT ORIENTED APPROACH , 2008 .

[50] Christian Wimmer,et al. One VM to rule them all , 2013, Onward!.

[51] J. G. Ferreira,et al. ECOWIN — an object-oriented ecological model for aquatic ecosystems , 1995 .

[52] Nathaniel Nystrom,et al. Firepile: run-time compilation for GPUs in scala , 2011, GPCE '11.

[53] Gerhard Wellein,et al. Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips , 2014, WPMVP '14.

[54] James O. Coplien,et al. Curiously recurring template patterns , 1995 .

[55] Masao Kuwahara,et al. A development of a traffic simulator for urban road networks: AVENUE , 1994, Proceedings of VNIS'94 - 1994 Vehicle Navigation and Information Systems Conference.

[56] Rajkishore Barik,et al. Efficient Mapping of Irregular C++ Applications to Integrated GPUs , 2014, CGO '14.

[57] Henk Corporaal,et al. Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.

[58] Mark Lee,et al. Vectorized production path tracing , 2017, High Performance Graphics.

[59] Sophia Drossopoulou,et al. You can have it all: abstraction and good cache performance , 2017, Onward!.

[60] Andrew S. Grimshaw,et al. High-Performance and Scalable GPU Graph Traversal , 2015, ACM Trans. Parallel Comput..

[61] Sudip K. Seal,et al. Efficient simulation of agent-based models on multi-GPU and multi-core clusters , 2010, SimuTools.

[62] Lionel Lacassagne,et al. Batched Cholesky factorization for tiny matrices , 2016, 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP).

[63] Jeannette M. Wing,et al. A behavioral notion of subtyping , 1994, TOPL.

[64] Vincent B. C. Tan,et al. Adaptive floating node method for modelling cohesive fracture of composite materials , 2018 .

[65] Erez Petrank,et al. The Compressor: concurrent, incremental, and parallel compaction , 2006, PLDI '06.

[66] Stephen Jones,et al. XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[67] James C. King,et al. Symbolic execution and program testing , 1976, CACM.

[68] Michael Franz,et al. Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization , 2018, ECOOP.

[69] Stanley B. Lippman. C++ gems , 1996 .

[70] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[71] D. Quinlan,et al. ROSE: Compiler Support for Object-Oriented Frameworks , 1999, Parallel Process. Lett..

[72] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[73] Roshan M. D'Souza,et al. A Framework for Megascale Agent Based Model Simulations on Graphics Processing Units , 2008, J. Artif. Soc. Soc. Simul..

[74] John D. Owens,et al. Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[75] Kei Davis,et al. Parallel Object-Oriented Scientific Computing Today , 2003, ECOOP Workshops.

[76] Martín Abadi,et al. Dynamic typing in a statically-typed language , 1989, POPL '89.

[77] M. Snir,et al. Ghost Cell Pattern , 2010, ParaPLoP '10.

[78] Xinxin Mei,et al. Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[79] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[80] Joshua M. Epstein,et al. Growing Artificial Societies: Social Science from the Bottom Up , 1996 .

[81] Robert Strzodka,et al. Abstraction for AoS and SoA layout in C , 2011 .

[82] Firas Hamze,et al. A Performance Comparison of CUDA and OpenCL , 2010, ArXiv.

[83] Radek Stibora. Building of SBVH on Graphical Hardware , 2016 .

[84] Vernon Rego,et al. Efficient Algorithms for Stream Compaction on GPUs , 2017, Int. J. Netw. Comput..

[85] 簡聰富,et al. 物件導向軟體之架構(Object-Oriented Software Construction)探討 , 1989 .

[86] Vlastimil Havran,et al. Register Efficient Dynamic Memory Allocator for GPUs , 2015, Comput. Graph. Forum.

[87] Shigeru Chiba,et al. A metaobject protocol for C++ , 1995, OOPSLA.

[88] Stephen John Turner,et al. Supporting efficient execution of continuous space agent‐based simulation on GPU , 2016, Concurr. Comput. Pract. Exp..

[89] Glenn Krasner,et al. Smalltalk-80: bits of history, words of advice , 1983 .

[90] Simon D. Hammond,et al. Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[91] Mark Moir,et al. SNZI: scalable NonZero indicators , 2007, PODC '07.

[92] Fan Yao,et al. XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs , 2019, HPDC.

[93] R. D'Souza. SUGARSCAPE ON STEROIDS : SIMULATING OVER A MILLION AGENTS , 2007 .

[94] Kunle Olukotun,et al. A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[95] Stephen John Turner,et al. Cloning Agent-based Simulation on GPU , 2015, SIGSIM-PADS.

[96] Daniel H. H. Ingalls. A Simple Technique for Handling Multiple Polymorphism , 1986, OOPSLA.

[97] Maged M. Michael. Safe memory reclamation for dynamic lock-free objects using atomic reads and writes , 2002, PODC '02.

[98] William R. Cook,et al. Mixin-based inheritance , 1990, OOPSLA/ECOOP '90.

[99] M. Steinberger,et al. ScatterAlloc: Massively parallel dynamic memory allocation for the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[100] Kathryn S. McKinley,et al. Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[101] No License,et al. Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[102] Holger Homann,et al. SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes , 2017, Comput. Phys. Commun..

[103] Robert Hirschfeld,et al. Columnar objects: improving the performance of analytical applications , 2015, Onward!.

[104] Michael Goesele,et al. Fast dynamic memory allocator for massively parallel architectures , 2013, GPGPU@ASPLOS.

[105] Carlchristian Eckert,et al. Enhancements of the massively parallel memory allocator ScatterAlloc and its adaption to the general interface mallocMC , 2014 .

[106] Viera K. Proulx. Traffic simulation: a case study for teaching object oriented design , 1998, SIGCSE '98.

[107] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[108] Rj Allan,et al. Survey of Agent Based Modelling and Simulation Tools , 2009 .

[109] Jianbin Fang,et al. A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[110] Paul D. Gilbert. Creating Stand-Alone Smalltalk Applications , 1988 .

[111] Peter Wegner,et al. Concepts and paradigms of object-oriented programming , 1990, OOPS.

[112] Alastair F. Donaldson,et al. Exposing errors related to weak memory in GPU applications , 2016, PLDI.

[113] Trevor Brown,et al. Techniques for Constructing Efficient Lock-free Data Structures , 2017, ArXiv.

[114] Fatos Xhafa,et al. Programming multi-core and many-core computing systems , 2014 .

[115] Dirk Riehle,et al. Value object , 2006, PLoP '06.

[116] Benjamin Keinert,et al. Real-time local displacement using dynamic GPU memory management , 2013, HPG '13.

[117] William Silvert,et al. Object-oriented ecosystem modelling , 1993 .

[118] Michel Steuwer,et al. A Composable Array Function Interface for Heterogeneous Computing in Java , 2014, ARRAY@PLDI.

[119] Thomas Fahringer,et al. Automatic Data Layout Optimizations for GPUs , 2015, Euro-Par.

[120] Ludek Matyska,et al. Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[121] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[122] Maged M. Michael. Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[123] Vivek Sarkar,et al. Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[124] James R. Larus,et al. Cache-conscious structure definition , 1999, PLDI '99.

[125] Jingyue Wu,et al. gpucc: An open-source GPGPU compiler , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[126] Urs Hölzle,et al. Eliminating Virtual Function Calls in C++ Programs , 1996, ECOOP.

[127] Erez Petrank,et al. An efficient parallel heap compaction algorithm , 2004, OOPSLA.

[128] Kunle Olukotun,et al. Building-Blocks for Performance Oriented DSLs , 2011, DSL.

[129] M. Pharr,et al. ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[130] Chen Ding,et al. Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[131] Sudhakar Yalamanchili,et al. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[132] Bart De Moor,et al. Transportation Planning and Traffic Flow Models , 2005 .

[133] Kenta Oono,et al. Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[134] Ryan Newton,et al. Region-based memory management for GPU programming languages: enabling rich data structures on a spartan host , 2014, OOPSLA.

[135] Andrew A. Chien,et al. An automatic object inlining optimization and its evaluation , 2000, PLDI '00.

[136] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[137] Vivek Sarkar,et al. Compiler-Driven Data Layout Transformation for Heterogeneous Platforms , 2013, Euro-Par Workshops.

[138] Sang-Hee Lee,et al. Effects of wind and tree density on forest fire patterns in a mixed-tree species forest , 2017 .

[139] Sebastian Hack,et al. Sierra: a SIMD extension for C++ , 2014, WPMVP '14.

[140] Keshav Pingali,et al. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[141] Iisakki Kosonen. HUTSIM: SIMULATION TOOL FOR TRAFFIC SIGNAL CONTROL PLANNING , 1996 .

[142] Paul R. Wilson,et al. The memory fragmentation problem: solved? , 1998, ISMM '98.

[143] Piet Hut,et al. A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[144] M Mernik,et al. When and how to develop domain-specific languages , 2005, CSUR.

[145] Matthias Felleisen. Functional Objects , 2004, ECOOP.

[146] James Abel,et al. Applications Tuning for Streaming SIMD Extensions , 1999 .

[147] Yoav Ossia,et al. Mostly concurrent compaction for mark-sweep GC , 2004, ISMM '04.

[148] Elliott W. Montroll,et al. Nonlinear Population Dynamics. (Book Reviews: On the Volterra and Other Nonlinear Models of Interacting Populations) , 1971 .

[149] John D. Owens,et al. A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[150] Debasis Das. A Survey on Cellular Automata and Its Applications , 2011 .

[151] Sophia Drossopoulou,et al. Extending SHAPES for SIMD Architectures: An approach to native support for Struct of Arrays in languages , 2018, ICOOOLPS@ECOOP.

[152] Kenli Li,et al. Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[153] Ulrich Rüde,et al. Expression Templates Revisited: A Performance Analysis of Current Methodologies , 2011, SIAM J. Sci. Comput..

[154] David F. Bacon,et al. Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[155] Ralph Johnson,et al. design patterns elements of reusable object oriented software , 2019 .

[156] Joseph Kehoe. The Specification of Sugarscape , 2015, ArXiv.

[157] Michael Philippsen,et al. Parallel memory defragmentation on a GPU , 2012, MSPC '12.

[158] Paul W. Rendell,et al. Game of Life Universal Turing Machine , 2016 .

[159] Ching-Lung Su,et al. Overview and comparison of OpenCL and CUDA technology for GPGPU , 2012, 2012 IEEE Asia Pacific Conference on Circuits and Systems.

[160] Marc Snir,et al. Transformation for class immutability , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[161] Xiaoming Li,et al. CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator , 2009, 2009 International Conference on Parallel Processing Workshops.

[162] Vasily Volkov,et al. Understanding Latency Hiding on GPUs , 2016 .

[163] Bjarne Stroustrup. Foundations of C++ , 2012, ESOP.

[164] Stefan Hanenberg,et al. How do API documentation and static typing affect API usability? , 2014, ICSE.

[165] Ana Lucia Varbanescu,et al. KMA: A Dynamic Memory Manager for OpenCL , 2014, GPGPU@ASPLOS.

[166] Laxmi N. Bhuyan,et al. Efficient warp execution in presence of divergence with collaborative context collection , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[167] Keir Fraser,et al. Practical lock-freedom , 2003 .

[168] Amos O. Olagunju,et al. The Benefits of Object-oriented Methodology for Software Development , 2015 .

[169] Timothy G. Rogers,et al. Characterizing the Runtime Effects of Object-Oriented Workloads on GPUs , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[170] Massimiliano Fatica,et al. Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[171] Ingo Wald,et al. Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[172] Robert Strzodka. Data layout optimization for multi-valued containers in OpenCL , 2012, J. Parallel Distributed Comput..

[173] Michael Garland,et al. Throughput-oriented GPU memory allocation , 2019, PPoPP.

[174] Jeff Bonwick,et al. The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.

[175] Stefania Bandini,et al. Agent Based Modeling and Simulation: An Informatics Perspective , 2009, J. Artif. Soc. Soc. Simul..