A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

[1]  Majid Sarrafzadeh,et al.  A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[2]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[3]  Hans-Peter Seidel,et al.  Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.

[4]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[5]  Xiaoming Li,et al.  A control-structure splitting optimization for GPGPU , 2009, CF '09.

[6]  Guohua Jin,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[7]  Hui Wu,et al.  Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs , 2010, 2010 39th International Conference on Parallel Processing.

[8]  Peter Bailey,et al.  Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors , 2009, 2009 International Conference on Parallel Processing.

[9]  Michael F. P. O'Boyle,et al.  The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..

[10]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[11]  Lam H. Nguyen,et al.  Hybrid Core Acceleration of UWB SIRE Radar Signal Processing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[12]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[13]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[14]  Ken Kennedy,et al.  Profitable loop fusion and tiling using model-driven empirical search , 2006, ICS '06.

[15]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[16]  Y. N. Srikant,et al.  Microarchitecture Sensitive Empirical Models for Compiler Optimizations , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[17]  Michael F. P. O'Boyle,et al.  The effect of cache models on iterative compilation for combined tiling and unrolling: Research Articles , 2004 .

[18]  Liang Gu,et al.  An empirically tuned 2D and 3D FFT library on CUDA GPU , 2010, ICS '10.

[19]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[20]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[21]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[22]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[24]  Gang Ren,et al.  Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization , 2005, LCPC.

[25]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[26]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[27]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[28]  L. Almagor,et al.  Finding effective compilation sequences , 2004, LCTES '04.

[29]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Philippe Bekaert,et al.  Optimal Data Distribution for Versatile Finite Impulse Response Filtering on Next-Generation Graphics Hardware Using CUDA , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[31]  Peter Messmer,et al.  Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[32]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[33]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[34]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.