MAPS

GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must struggle with complex index calculations and manual memory transfers. This article classifies memory access patterns used in most parallel algorithms, based on Berkeley’s Parallel “Dwarfs.” It then proposes the MAPS framework, a device-level memory abstraction that facilitates memory access on GPUs, alleviating complex indexing using on-device containers and iterators. This article presents an implementation of MAPS and shows that its performance is comparable to carefully optimized implementations of real-world applications.

[1]  Bjarne Stroustrup,et al.  The C++ Programming Language”, 3rd Edition, Pearson Education, 2007 , 2015 .

[2]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Pedro V. Sander,et al.  Fast triangle reordering for vertex locality and reduced overdraw , 2007, SIGGRAPH 2007.

[4]  Ade Miller,et al.  C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++ , 2012 .

[5]  John D. Owens,et al.  Glift: Generic, efficient, random-access GPU data structures , 2006, TOGS.

[6]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[7]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[8]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[9]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Xavier Provot,et al.  Deformation Constraints in a Mass-Spring Model to Describe Rigid Cloth Behavior , 1995 .

[11]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[12]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[13]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[14]  Kevin Skadron,et al.  Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[16]  Jianbin Fang,et al.  Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels , 2014, 2014 43rd International Conference on Parallel Processing.

[17]  Hugues Hoppe,et al.  Optimization of mesh locality for transparent vertex caching , 1999, SIGGRAPH.

[18]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[19]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[20]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[21]  Michael Garland,et al.  Sparse matrix computations on manycore GPU’s , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[22]  David R. Musser,et al.  STL tutorial and reference guide, second edition: C++ programming with the standard template library , 2001 .

[23]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .