Auto-Tuning Complex Array Layouts for GPUs

The continuing evolution of Graphics Processing Units (GPU) has shown rapid performance increases over the years. But with each new hardware generation, the constraints for programming them efficiently have changed. Programs have to be tuned towards one specific hardware to unleash the full potential. This is time consuming and costly as vendors tend to release a new generation every 18 months. It is therefore important to auto-tune GPU code to achieve GPU-specific improvements. Using either static or empirical profiling to adjust parameters or to change the kernel implementation. We introduce a new approach to automatically improve memory access on GPUs. Our system generates an application specific library which abstracts the memory access for complex arrays on the host and GPU side. This allows to optimize the code by exchanging the memory layout without recompiling the application, as all necessary layouts are pre-compiled into the library. Our implementation is able to speedup real-world applications up to an order of magnitude and even outperforms hand-tuned implementations.

[1]  Anjul Patney,et al.  Real-time Reyes-style adaptive surface subdivision , 2008, SIGGRAPH Asia '08.

[2]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[3]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[4]  S. Popov,et al.  Experiences with Streaming Construction of SAH KD-Trees , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[5]  Ingo Wald,et al.  Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture , 2012, IEEE Transactions on Visualization and Computer Graphics.

[6]  Michael C. Doggett,et al.  Auto-tuning interactive ray tracing using an analytical GPU architecture model , 2012, GPGPU-5.

[7]  Kun Zhou,et al.  RenderAnts: interactive Reyes rendering on GPUs , 2009, SIGGRAPH 2009.

[8]  Liqiang Wang,et al.  Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs , 2010, 2010 International Conference on Computational and Information Sciences.

[9]  He Huang,et al.  A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs , 2011 .

[10]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[11]  David D. Cox,et al.  Machine learning for predictive auto-tuning with boosted regression trees , 2012, 2012 Innovative Parallel Computing (InPar).

[12]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[13]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[14]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[15]  Mary Hall,et al.  Autotuning, code generation and optimizing compiler technology for gpus , 2012 .

[16]  Hans Henrik Brandenborg Sørensen,et al.  Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs , 2011, PPAM.

[17]  Kun Zhou,et al.  RenderAnts: interactive Reyes rendering on GPUs , 2009, SIGGRAPH 2009.

[18]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[19]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[20]  Robert L. Cook,et al.  The Reyes image rendering architecture , 1987, SIGGRAPH.

[21]  Dana Schaa,et al.  Modeling execution and predicting performance in multi-GPU environments , 2009 .

[22]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[23]  Anjul Patney,et al.  Real-time Reyes-style adaptive surface subdivision , 2008, SIGGRAPH 2008.

[24]  Frank Mueller,et al.  Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters , 2013, IEEE Transactions on Parallel and Distributed Systems.

[25]  Jan Vitek,et al.  Terra: a multi-stage language for high-performance computing , 2013, PLDI.