Accelerating High-Order Stencils on GPUs
暂无分享,去创建一个
John Mellor-Crummey | Mauricio Araya-Polo | Xiaozhu Meng | Jie Meng | Ryuichi Sai | J. Mellor-Crummey | M. Araya-Polo | Xiaozhu Meng | R. Sai | Jie Meng
[1] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[2] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[3] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[4] Torsten Hoefler,et al. Domain-Specific Multi-Level IR Rewriting for GPU , 2020, ACM Trans. Archit. Code Optim..
[5] Nan Ding,et al. An Instruction Roofline Model for GPUs , 2019, 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[6] P. Sadayappan,et al. Register optimizations for stencils on GPUs , 2018, PPoPP.
[7] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[8] Michael Isard,et al. A functional pattern-based language in mlir , 2020 .
[9] John D. McCalpin,et al. Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .
[10] P. Sadayappan,et al. On Optimizing Complex Stencils on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[11] Henri Calandra,et al. Minimod: A Finite Difference solver for Seismic Modeling , 2020, ArXiv.
[12] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.
[13] Guohua Jin,et al. Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[14] Tobias Gysi,et al. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..
[15] Felix J. Herrmann,et al. Interactive comment on “ Devito ( v 3 . 1 . 0 ) : an embedded domain-specific language for finite differences and geophysical exploration , 2018 .
[16] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[17] Uday Bondhugula,et al. Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[19] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] David G. Wonnacott,et al. Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.
[21] John Mellor-Crummey,et al. A tool for top-down performance analysis of GPU-accelerated applications , 2020, PPoPP.
[22] P. Sadayappan,et al. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.
[23] Mauricio Araya-Polo,et al. Algorithm 942 , 2014 .
[24] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[25] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[26] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[27] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[28] Elnar Hajiyev,et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[29] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[30] Jeroen Tromp,et al. A perfectly matched layer absorbing boundary condition for the second-order seismic wave equation , 2003 .
[31] Mohamed Wahib,et al. AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.
[32] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[33] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[34] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.
[35] Hans-Peter Seidel,et al. Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.
[36] J. Ramanujam,et al. SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.
[37] Volker Strumpen,et al. The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.