论文信息 - A versatile software systolic execution model for GPU memory-bound kernels

A versatile software systolic execution model for GPU memory-bound kernels

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5× faster than Nvidia's NPP on V100 and P100 GPUs.

[1] Benjamin W. Wah,et al. The Design of Optimal Systolic Arrays , 1985, IEEE Transactions on Computers.

[2] Samuel Williams,et al. Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[4] Xinxin Mei,et al. Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[5] Gerhard W. Zumbusch. Vectorized Higher Order Finite Difference Kernels , 2012, PARA.

[6] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.

[7] Satoshi Matsuoka,et al. Efficient Algorithms for the Summed Area Tables Primitive on GPUs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[8] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[9] Patrice Quinton,et al. The systematic design of systolic arrays , 1987 .

[10] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[11] M. Rubinoff,et al. Numerical solution of differential equations , 1954, AIEE-IRE '54 (Eastern).

[12] Torsten Hoefler,et al. Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs , 2019, ArXiv.

[13] P. Kloeden,et al. Numerical Solution of Stochastic Differential Equations , 1992 .

[14] Stephen John Turner,et al. Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .

[16] H. T. Kung. Let's Design Algorithms for VLSI Systems , 1979 .

[17] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[19] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[20] Kurt Keutzer,et al. A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[21] James Demmel,et al. the Parallel Computing Landscape , 2022 .

[22] David E. Keyes,et al. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[23] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24] Trieu-Kien Truong,et al. Systolic Multipliers for Finite Fields GF(2m) , 1984, IEEE Transactions on Computers.

[25] Torsten Hoefler,et al. Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[26] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27] Weifeng Liu,et al. Fast segmented sort on GPUs , 2017, ICS.

[28] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[29] John D. Owens,et al. Register packing for cyclic reduction: a case study , 2011, GPGPU-4.

[30] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[31] TUNING CUDA APPLICATIONS FOR VOLTA , 2018 .

[32] Satoshi Matsuoka,et al. Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[33] Guangwei Zhang,et al. Modeling the Performance of 2.5D Blocking of 3D Stencil Code on GPUs , 2016 .

[34] P. Sadayappan,et al. Associative Instruction Reordering to Alleviate Register Pressure , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35] Pablo Enfedaque,et al. Implementation of the DWT in a GPU through a Register-based Strategy , 2015, IEEE Transactions on Parallel and Distributed Systems.

[36] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[37] H. T. Kung. Why systolic architectures? , 1982, Computer.

[38] Yang Ling. The numerical solution of stochastic differential equation of ito-volterra type drived by wiener process , 2006 .

[39] Shan Huang,et al. Tessellating Stencils , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40] Keshab K. Parhi,et al. A novel systolic array structure for DCT , 2005, IEEE Transactions on Circuits and Systems II: Express Briefs.

[41] Eli Ben-Sasson,et al. Fast Multiplication in Binary Fields on GPUs via Register Cache , 2016, ICS.

[42] Martin Griebl,et al. Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[43] Albert Cohen,et al. The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[44] Shuaiwen Song,et al. Warp-Consolidation: A Novel Execution Model for GPUs , 2018, ICS.

[45] Vincent Loechner. PolyLib: A Library for Manipulating Parameterized Polyhedra , 1999 .

[46] Rudolf Eigenmann,et al. RegDem: Increasing GPU Performance via Shared Memory Register Spilling , 2019, ArXiv.

[47] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[48] Harold S. Stone,et al. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[49] Dan I. Moldovan,et al. ADVIS: A Software Package for the Design of Systolic Arrays , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[50] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[51] Moon Ho Lee,et al. Simple systolic arrays for discrete cosine transform , 1990, Multidimens. Syst. Signal Process..

[52] Samuel Williams,et al. Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks , 2018, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[53] John V. McCanny,et al. OPTIMISED BIT LEVEL SYSTOLIC ARRAY FOR CONVOLUTION. , 1984 .

[54] Viktor K. Prasanna,et al. On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication , 1991, IEEE Trans. Computers.

[55] Himanshu Bhathagar. Advanced ASIC Chip Synthesis , 1999 .

[56] P. Sadayappan,et al. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.

[57] Doran Wilde,et al. A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .