Towards Accelerating High-Order Stencils on Modern GPUs and Emerging Architectures with a Portable Framework
暂无分享,去创建一个
[1] M. Araya-Polo,et al. Massively Distributed Finite-Volume Flux Computation , 2023, SC Workshops.
[2] M. Araya-Polo,et al. Scalable Distributed High-Order Stencil Computations , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] John M. Mellor-Crummey,et al. Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs , 2021, 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[4] John Mellor-Crummey,et al. Accelerating High-Order Stencils on GPUs , 2020, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[5] Yuanming Hu,et al. The Taichi programming language , 2020, SIGGRAPH Courses.
[6] Henri Calandra,et al. Minimod: A Finite Difference solver for Seismic Modeling , 2020, ArXiv.
[7] Jaime Fern'andez del R'io,et al. Array programming with NumPy , 2020, Nature.
[8] Torsten Hoefler,et al. Domain-Specific Multi-Level IR Rewriting for GPU , 2020, ACM Trans. Archit. Code Optim..
[9] Uday Bondhugula,et al. MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.
[10] Mohamed Wahib,et al. AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.
[11] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[12] Ulrich Rüde,et al. Code generation for massively parallel phase-field simulations , 2019, SC.
[13] P. Sadayappan,et al. On Optimizing Complex Stencils on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[14] P. Sadayappan,et al. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations , 2018, Proceedings of the IEEE.
[15] Felix J. Herrmann,et al. Devito: an embedded domain-specific language for finite differences and geophysical exploration , 2018, Geoscientific Model Development.
[16] Philipp A. Witte,et al. Architecture and Performance of Devito, a System for Automated Stencil Computation , 2018, ACM Trans. Math. Softw..
[17] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[18] P. Sadayappan,et al. Register optimizations for stencils on GPUs , 2018, PPoPP.
[19] Michel Steuwer,et al. LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[20] Stefan Bilbao,et al. Large Stencil Operations for GPU-based 3-D Acoustics Simulations , 2015 .
[21] J. Ramanujam,et al. SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.
[22] Elnar Hajiyev,et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[23] Tobias Gysi,et al. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..
[24] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.
[25] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[26] Uday Bondhugula,et al. Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[28] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[29] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[30] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[31] Hans-Peter Seidel,et al. Cache oblivious parallelograms in iterative stencil computations , 2010, ICS '10.
[32] José María Cela,et al. Introducing the Semi-stencil Algorithm , 2009, PPAM.
[33] Volker Strumpen,et al. The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.
[34] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[35] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[36] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[37] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[38] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[39] Jeroen Tromp,et al. A perfectly matched layer absorbing boundary condition for the second-order seismic wave equation , 2003 .
[40] David G. Wonnacott,et al. Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.
[41] Guohua Jin,et al. Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[42] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[43] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).
[44] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.
[45] Michael Isard,et al. A functional pattern-based language in mlir , 2020 .
[46] Raúl de la Cruz,et al. Algorithm 942: Semi-Stencil , 2014, ACM Trans. Math. Softw..
[47] John D. McCalpin,et al. Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .