Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations
暂无分享,去创建一个
P. Sadayappan | Aravind Sukumaran-Rajam | Atanas Rountev | Louis-Noël Pouchet | Vinod Grover | Prashant Singh Rawat | Mahesh Ravishankar | Miheer Vaidya | A. Rountev | P. Sadayappan | Mahesh Ravishankar | L. Pouchet | Vinod Grover | Aravind Sukumaran-Rajam | P. Rawat | Miheer Vaidya
[1] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[2] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[3] Mohamed Wahib,et al. Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[4] Alejandro Duran,et al. YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).
[5] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[7] A Thesis,et al. Tiling Stencil Computations to Maximize Parallelism , 2013 .
[8] Paulius Micikevicius,et al. Fusing convolution kernels through tiling , 2015, ARRAY@PLDI.
[9] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.
[10] J. Ramanujam,et al. A framework for enhancing data reuse via associative reordering , 2014, PLDI.
[11] Uday Bondhugula,et al. Loop transformations: convexity, pruning and optimization , 2011, POPL '11.
[12] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.
[13] Xing Zhou,et al. Hierarchical overlapped tiling , 2012, CGO '12.
[14] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .
[15] Mario A. R. Dantas,et al. Extending OpenACC for Efficient Stencil Code Generation and Execution by Skeleton Frameworks , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).
[16] Mark F. Adams,et al. Chombo Software Package for AMR Applications Design Document , 2014 .
[17] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.
[18] P. Sadayappan,et al. Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[19] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.
[20] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[21] Tobias Gysi,et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[23] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[24] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.
[25] Richard Veras,et al. A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.
[26] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[27] Samuel Williams,et al. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers , 2017, Parallel Comput..
[28] Vinod Grover,et al. Forma: a DSL for image processing applications to target GPUs and multi-core CPUs , 2015, GPGPU@PPoPP.
[29] Catherine Mills Olschanowsky,et al. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[30] Catherine Mills Olschanowsky,et al. Transforming loop chains via macro dataflow graphs , 2018, CGO.
[31] P. Sadayappan,et al. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs , 2016, GPGPU@PPoPP.
[32] Cosmin Nita,et al. Optimized three-dimensional stencil computation on Fermi and Kepler GPUs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).
[33] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[34] Sanjay V. Rajopadhye,et al. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils , 2017, PPoPP.
[35] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[36] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[37] D. Qainlant,et al. ROSE: Compiler Support for Object-Oriented Frameworks , 1999 .
[38] Peter Messmer,et al. Parallel data-locality aware stencil computations on modern micro-architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[39] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..
[40] Frank Mueller,et al. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.
[41] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[42] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[43] Torsten Hoefler,et al. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.
[44] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.