论文信息 - Inner array inlining for structure of arrays layout

Inner array inlining for structure of arrays layout

Previous work has shown how the well-studied and SIMD-friendly Structure of Arrays (SOA) data layout strategy can speed up applications in high-performance computing compared to a traditional Array of Structures (AOS) data layout. However, a standard SOA layout cannot handle structures with inner arrays; such structures appear frequently in graph-based applications and object-oriented designs with associations of high multiplicity. This work extends the SOA data layout to structures with array-typed fields. We present different techniques for inlining (embedding) inner arrays into an AOS or SOA layout, as well as the design and implementation of an embedded C++/CUDA DSL that lets programmers write such layouts in a notation close to standard C++. We evaluate several layout strategies with a traffic flow simulation, an important real-world application in transport planning.

Hidehiko Masuhara | Matthias Springer | Yaozhu Sun

[1] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2] Viera K. Proulx. Traffic simulation: a case study for teaching object oriented design , 1998, SIGCSE '98.

[3] Holger Homann,et al. SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes , 2017, Comput. Phys. Commun..

[4] Michael Schreckenberg,et al. A cellular automaton traffic flow model for online simulation of traffic , 2001, Parallel Comput..

[5] P. J. Narayanan,et al. Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[6] Kai Nagel,et al. Multi-agent traffic simulation with CUDA , 2009, 2009 International Conference on High Performance Computing & Simulation.

[7] Michael Schreckenberg,et al. A cellular automaton model for freeway traffic , 1992 .

[8] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[9] Nuno Faria,et al. Impact of Data Structure Layout on Performance , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[10] Thomas Fahringer,et al. Automatic Data Layout Optimizations for GPUs , 2015, Euro-Par.

[11] Mark J. Harris. CUDA: performance tips and tricks , 2007, SIGGRAPH '07.

[12] M. Schreckenberg,et al. Microscopic Simulation of Urban Traffic Based on Cellular Automata , 1997 .

[13] Andrew S. Grimshaw,et al. High-Performance and Scalable GPU Graph Traversal , 2015, ACM Trans. Parallel Comput..

[14] Robert Strzodka,et al. Abstraction for AoS and SoA layout in C , 2011 .

[15] Kai Nagel,et al. Using common graphics hardware for multi-agent traffic simulation with CUDA , 2009, SimuTools.

[16] Geoff Boeing,et al. OSMnx: New Methods for Acquiring, Constructing, Analyzing, and Visualizing Complex Street Networks , 2016, Comput. Environ. Urban Syst..

[17] Bart De Moor,et al. Transportation Planning and Traffic Flow Models , 2005 .

[18] Andrew A. Chien,et al. An automatic object inlining optimization and its evaluation , 2000, PLDI '00.

[19] Michael Goesele,et al. Auto-Tuning Complex Array Layouts for GPUs , 2014, EGPGV@EuroVis.

[20] Hidehiko Masuhara,et al. Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout , 2018, WPMVP@PPoPP.

[21] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[22] Martin D. F. Wong,et al. An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[23] Jure Leskovec,et al. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[24] M. Pharr,et al. ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[25] Hanspeter Mössenböck,et al. Automatic array inlining in java virtual machines , 2008, CGO '08.

[26] Kai Nagel,et al. Using common graphics hardware for multi-agent traffic simulation with CUDA , 2009, SIMUTools 2009.

[27] Tarek S. Abdelrahman,et al. Launch-Time Optimization of OpenCL GPU Kernels , 2017, GPGPU@PPoPP.