A versatile software systolic execution model for GPU memory-bound kernels

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5× faster than Nvidia's NPP on V100 and P100 GPUs.

[1]  Benjamin W. Wah,et al.  The Design of Optimal Systolic Arrays , 1985, IEEE Transactions on Computers.

[2]  Samuel Williams,et al.  Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3]  Mingyu Chen,et al.  Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[4]  Xinxin Mei,et al.  Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[5]  Gerhard W. Zumbusch Vectorized Higher Order Finite Difference Kernels , 2012, PARA.

[6]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[7]  Satoshi Matsuoka,et al.  Efficient Algorithms for the Summed Area Tables Primitive on GPUs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[8]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[9]  Patrice Quinton,et al.  The systematic design of systolic arrays , 1987 .

[10]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[11]  M. Rubinoff,et al.  Numerical solution of differential equations , 1954, AIEE-IRE '54 (Eastern).

[12]  Torsten Hoefler,et al.  Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs , 2019, ArXiv.

[13]  P. Kloeden,et al.  Numerical Solution of Stochastic Differential Equations , 1992 .

[14]  Stephen John Turner,et al.  Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[16]  H. T. Kung Let's Design Algorithms for VLSI Systems , 1979 .

[17]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[19]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[20]  Kurt Keutzer,et al.  A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[21]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[22]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[23]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24]  Trieu-Kien Truong,et al.  Systolic Multipliers for Finite Fields GF(2m) , 1984, IEEE Transactions on Computers.

[25]  Torsten Hoefler,et al.  Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[26]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Weifeng Liu,et al.  Fast segmented sort on GPUs , 2017, ICS.

[28]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[29]  John D. Owens,et al.  Register packing for cyclic reduction: a case study , 2011, GPGPU-4.

[30]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[31]  TUNING CUDA APPLICATIONS FOR VOLTA , 2018 .

[32]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[33]  Guangwei Zhang,et al.  Modeling the Performance of 2.5D Blocking of 3D Stencil Code on GPUs , 2016 .

[34]  P. Sadayappan,et al.  Associative Instruction Reordering to Alleviate Register Pressure , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Pablo Enfedaque,et al.  Implementation of the DWT in a GPU through a Register-based Strategy , 2015, IEEE Transactions on Parallel and Distributed Systems.

[36]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[37]  H. T. Kung Why systolic architectures? , 1982, Computer.

[38]  Yang Ling The numerical solution of stochastic differential equation of ito-volterra type drived by wiener process , 2006 .

[39]  Shan Huang,et al.  Tessellating Stencils , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Keshab K. Parhi,et al.  A novel systolic array structure for DCT , 2005, IEEE Transactions on Circuits and Systems II: Express Briefs.

[41]  Eli Ben-Sasson,et al.  Fast Multiplication in Binary Fields on GPUs via Register Cache , 2016, ICS.

[42]  Martin Griebl,et al.  Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[43]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[44]  Shuaiwen Song,et al.  Warp-Consolidation: A Novel Execution Model for GPUs , 2018, ICS.

[45]  Vincent Loechner PolyLib: A Library for Manipulating Parameterized Polyhedra , 1999 .

[46]  Rudolf Eigenmann,et al.  RegDem: Increasing GPU Performance via Shared Memory Register Spilling , 2019, ArXiv.

[47]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[48]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[49]  Dan I. Moldovan,et al.  ADVIS: A Software Package for the Design of Systolic Arrays , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[50]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[51]  Moon Ho Lee,et al.  Simple systolic arrays for discrete cosine transform , 1990, Multidimens. Syst. Signal Process..

[52]  Samuel Williams,et al.  Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks , 2018, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[53]  John V. McCanny,et al.  OPTIMISED BIT LEVEL SYSTOLIC ARRAY FOR CONVOLUTION. , 1984 .

[54]  Viktor K. Prasanna,et al.  On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication , 1991, IEEE Trans. Computers.

[55]  Himanshu Bhathagar Advanced ASIC Chip Synthesis , 1999 .

[56]  P. Sadayappan,et al.  Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[57]  Doran Wilde,et al.  A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .