Performance Limits Study of Stencil Codes on Modern GPGPUs
暂无分享,去创建一个
[1] Alejandro Duran,et al. Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[2] Sanjay V. Rajopadhye,et al. Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.
[3] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .
[4] Hikaru Inoue,et al. Automatic generation of efficient codes from mathematical descriptions of stencil computation , 2016, FHPC@ICFP.
[5] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[6] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[7] Hao Wang,et al. GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs , 2017, Conf. Computing Frontiers.
[8] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[9] B. Fornberg. Generation of finite difference formulas on arbitrarily spaced grids , 1988 .
[10] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[11] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[12] Lorena A. Barba,et al. How Will the Fast Multipole Method Fare in the Exascale Era , 2013 .
[13] Marcin Dabrowski,et al. Efficient 3D stencil computations using CUDA , 2013, Parallel Comput..
[14] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[15] Vadim D. Levchenko,et al. Detailed numerical simulation of shock-body interaction in 3D multicomponent flow using the RKDG numerical method and ”DiamondTorre” GPU algorithm of implementation , 2016 .
[16] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.
[17] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Sergei Gorlatch,et al. High performance stencil code generation with Lift , 2018, CGO.
[19] Toshio Endo. Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memory Hierarchy , 2018, 2018 IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA).
[20] Massimiliano Fatica,et al. Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[21] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.
[22] Takeshi Fukaya,et al. Time-space tiling with tile-level parallelism for the 3D FDTD method , 2018, HPC Asia.
[23] Junichiro Makino,et al. Optimal Temporal Blocking for Stencil Computation , 2015, ICCS.
[24] Danilo De Donno,et al. Introduction to GPU Computing and CUDA Programming: A Case Study on FDTD [EM Programmer's Notebook] , 2010 .
[25] Andrey Zakirov,et al. High performance FDTD algorithm for GPGPU supercomputers , 2016 .
[26] Hans-Joachim Bungartz,et al. A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters , 2017, Comput..
[27] V. Levchenko,et al. Locally Recursive Non-Locally Asynchronous Algorithms for Stencil Computation , 2018 .
[28] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.