Polyhedral Model Guided Automatic GPU Cache Exploitation Framework

We propose a compiler driven acceleration of parallel computations on GPUs by exploiting the various special varieties of caches (texture, surface and constant for NVIDIA GPUs). We show that our method obtains better performance for a class of computations when compared with earlier methods that use on-chip shared memory. We provide an end-to-end solution by developing a fully automatic, sound, static framework within a state-of-art source-to-source Polyhedral compiler (PPCG) to exploit these varieties of GPU caches. We use polyhedral model for profitability modeling of the particular variety of GPU caches. We evaluate our implementation on PolyBench/C benchmark kernels and report up to 1. 5x speedups over the current memory mapping strategy used by PPCG compiler. We also consider sample real-world representative kernels: PageRank, DNN layer (LSTM), solvers (Poisson and DWE-FDTD stencil), and show that using the special GPU caches in these programs results in up to 2. 6x speedup over a standard shared memory based implementation. We believe that our contribution is towards automatic exploitation of GPU cache/memory hierarchy as it shows general purpose computing usage of special GPU caches that were originally designed for image processing applications.

[1]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[3]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[5]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[6]  Anoop Gupta,et al.  The Design and Analysis of a Cache Architecture for Texture Mapping , 1997, ISCA.

[7]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[8]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[9]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[10]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[11]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[14]  Albert Cohen,et al.  Polyhedral AST Generation Is More Than Scanning Polyhedra , 2015, ACM Trans. Program. Lang. Syst..

[15]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[17]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[18]  Torsten Hoefler,et al.  Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.

[19]  P. Feautrier Parametric integer programming , 1988 .

[20]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[21]  Michael C. Doggett,et al.  Texture Caches , 2012, IEEE Micro.

[22]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.