Systolic Computing on GPUs for Productive Performance

We propose a language and compiler to productively build high-performance {\it software systolic arrays} that run on GPUs. Based on a rigorous mathematical foundation (uniform recurrence equations and space-time transform), our language has a high abstraction level and covers a wide range of applications. A programmer {\it specifies} a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler; the compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs. In this way, both productivity and performance are achieved in the same time. This approach neatly combines loop transformations, data shuffling, and vector register allocation into a single framework. Meanwhile, many other optimizations can be applied as well; the compiler composes the optimizations together to generate efficient code. We implemented the approach on Intel GPUs. This is the first system that allows productive construction of systolic arrays on GPUs. We allow multiple projections, arbitrary projection directions and linear schedules, which can express most, if not all, systolic arrays in practice. Experiments with 1- and 2-D convolution on an Intel GEN9.5 GPU have demonstrated the generality of the approach, and its productivity in expressing various systolic designs for finding the best candidate. Although our systolic arrays are purely software running on generic SIMD hardware, compared with the GPU's specialized, hardware samplers that perform the same convolutions, some of our best designs are up to 59\% faster. Overall, this approach holds promise for productive high-performance computing on GPUs.

[1]  Sun-Yuan Kung,et al.  Optimal Systolic Design for the Transitive Closure and the Shortest Path Problems , 1987, IEEE Transactions on Computers.

[2]  P. Segec,et al.  Systolic-based 2D convolver for CNN in FPGA , 2017, 2017 15th International Conference on Emerging eLearning Technologies and Applications (ICETA).

[3]  H. T. Kung Why systolic architectures? , 1982, Computer.

[4]  Hongbo Rong,et al.  Programmatic Control of a Compiler for Generating High-performance Spatial Hardware , 2017, ArXiv.

[5]  Enrique Alba,et al.  Systolic Optimization on GPU Platforms , 2011, EUROCAST.

[6]  Peter Vanbroekhoven,et al.  A practical dynamic single assignment transformation , 2007, TODE.

[7]  Jason Cong,et al.  Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  Jihyuck Jo,et al.  Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow , 2018, IEEE Transactions on Circuits and Systems I: Regular Papers.

[9]  Sun-Yuan Kung,et al.  A Systolic Design Methodology with Application to Full-Search Block-Matching Architectures , 1998, J. VLSI Signal Process..

[10]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Sun UltraSPARC,et al.  A closer look at GPUs , 2008, Commun. ACM.

[12]  Chris Rauer,et al.  Accelerating Genomics Research with OpenCL™ and FPGAs , 2017 .

[13]  Jack J. Dongarra,et al.  Virtual Systolic Array for QR Decomposition , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Patrice Quinton An Introduction to Systolic Architectures , 1986, Future Parallel Computers.

[15]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[16]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[17]  Jingling Xue Formal synthesis of control signals for systolic arrays , 1992 .

[18]  Nitish Srivastava,et al.  T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[19]  Patrice Quinton,et al.  The ALPHA language and its use for the design of systolic arrays , 1991, J. VLSI Signal Process..

[20]  Patrice Quinton Automatic synthesis of systolic arrays from uniform recurrent equations , 1984, ISCA '84.

[21]  Walid A. Najjar,et al.  Compiler generated systolic arrays for wavefront algorithm acceleration on FPGAs , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[22]  Satoshi Matsuoka,et al.  A versatile software systolic execution model for GPU memory-bound kernels , 2019, SC.

[23]  T. L. Chang,et al.  Programmable Systolic Arrays , 1982, COMPCON.

[24]  Sean O. Settle High-performance Dynamic Programming on FPGAs with OpenCL , 2013 .

[25]  Griselda Saldaña-González,et al.  FPGA Based Acceleration for Image Processing Applications , 2009 .

[26]  H. T. Kung,et al.  Systolic Arrays for (VLSI). , 1978 .

[27]  Jason Cong,et al.  SuSy: A Programming Model for Productive Construction of High-Performance Systolic Arrays on FPGAs , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).