A versatile software systolic execution model for GPU memory-bound kernels
暂无分享,去创建一个
Satoshi Matsuoka | Mohamed Wahib | Ryousei Takano | Peng Chen | Shinichiro Takizawa | S. Matsuoka | Ryousei Takano | M. Wahib | Peng Chen | Shin'ichiro Takizawa
[1] Benjamin W. Wah,et al. The Design of Optimal Systolic Arrays , 1985, IEEE Transactions on Computers.
[2] Samuel Williams,et al. Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[3] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.
[4] Xinxin Mei,et al. Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.
[5] Gerhard W. Zumbusch. Vectorized Higher Order Finite Difference Kernels , 2012, PARA.
[6] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[7] Satoshi Matsuoka,et al. Efficient Algorithms for the Summed Area Tables Primitive on GPUs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[8] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .
[9] Patrice Quinton,et al. The systematic design of systolic arrays , 1987 .
[10] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[11] M. Rubinoff,et al. Numerical solution of differential equations , 1954, AIEE-IRE '54 (Eastern).
[12] Torsten Hoefler,et al. Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs , 2019, ArXiv.
[13] P. Kloeden,et al. Numerical Solution of Stochastic Differential Equations , 1992 .
[14] Stephen John Turner,et al. Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[15] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .
[16] H. T. Kung. Let's Design Algorithms for VLSI Systems , 1979 .
[17] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[19] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[20] Kurt Keutzer,et al. A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[21] James Demmel,et al. the Parallel Computing Landscape , 2022 .
[22] David E. Keyes,et al. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..
[23] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[24] Trieu-Kien Truong,et al. Systolic Multipliers for Finite Fields GF(2m) , 1984, IEEE Transactions on Computers.
[25] Torsten Hoefler,et al. Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.
[26] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[27] Weifeng Liu,et al. Fast segmented sort on GPUs , 2017, ICS.
[28] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[29] John D. Owens,et al. Register packing for cyclic reduction: a case study , 2011, GPGPU-4.
[30] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.
[31] TUNING CUDA APPLICATIONS FOR VOLTA , 2018 .
[32] Satoshi Matsuoka,et al. Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.
[33] Guangwei Zhang,et al. Modeling the Performance of 2.5D Blocking of 3D Stencil Code on GPUs , 2016 .
[34] P. Sadayappan,et al. Associative Instruction Reordering to Alleviate Register Pressure , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[35] Pablo Enfedaque,et al. Implementation of the DWT in a GPU through a Register-based Strategy , 2015, IEEE Transactions on Parallel and Distributed Systems.
[36] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[37] H. T. Kung. Why systolic architectures? , 1982, Computer.
[38] Yang Ling. The numerical solution of stochastic differential equation of ito-volterra type drived by wiener process , 2006 .
[39] Shan Huang,et al. Tessellating Stencils , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[40] Keshab K. Parhi,et al. A novel systolic array structure for DCT , 2005, IEEE Transactions on Circuits and Systems II: Express Briefs.
[41] Eli Ben-Sasson,et al. Fast Multiplication in Binary Fields on GPUs via Register Cache , 2016, ICS.
[42] Martin Griebl,et al. Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).
[43] Albert Cohen,et al. The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.
[44] Shuaiwen Song,et al. Warp-Consolidation: A Novel Execution Model for GPUs , 2018, ICS.
[45] Vincent Loechner. PolyLib: A Library for Manipulating Parameterized Polyhedra , 1999 .
[46] Rudolf Eigenmann,et al. RegDem: Increasing GPU Performance via Shared Memory Register Spilling , 2019, ArXiv.
[47] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .
[48] Harold S. Stone,et al. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.
[49] Dan I. Moldovan,et al. ADVIS: A Software Package for the Design of Systolic Arrays , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[50] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..
[51] Moon Ho Lee,et al. Simple systolic arrays for discrete cosine transform , 1990, Multidimens. Syst. Signal Process..
[52] Samuel Williams,et al. Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks , 2018, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).
[53] John V. McCanny,et al. OPTIMISED BIT LEVEL SYSTOLIC ARRAY FOR CONVOLUTION. , 1984 .
[54] Viktor K. Prasanna,et al. On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication , 1991, IEEE Trans. Computers.
[55] Himanshu Bhathagar. Advanced ASIC Chip Synthesis , 1999 .
[56] P. Sadayappan,et al. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.
[57] Doran Wilde,et al. A LIBRARY FOR DOING POLYHEDRAL OPERATIONS , 2000 .