Overlapping Data Transfers with Computation on GPU with Tiles
暂无分享,去创建一个
John Shalf | Ann S. Almgren | Didem Unat | Weiqun Zhang | Burak Bastem | J. Shalf | Weiqun Zhang | A. Almgren | D. Unat | Burak Bastem
[1] Tianyi David Han,et al. Reducing branch divergence in GPU programs , 2011, GPGPU-4.
[2] Nathan Bell,et al. Thrust: A Productivity-Oriented Library for CUDA , 2012 .
[3] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[4] Sergei Gorlatch,et al. Programming GPUs with C++14 and Just-In-Time Compilation , 2015, PARCO.
[5] Torsten Hoefler,et al. dCUDA: Hardware Supported Overlap of Computation and Communication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] Scott B. Baden,et al. A new approach to interactive viewpoint selection for volume data sets , 2013, Inf. Vis..
[7] Jun Zhou,et al. Hands-on Performance Tuning of 3D Finite Difference Earthquake Simulation on GPU Fermi Chipset , 2012, ICCS.
[8] Richard D. Hornung,et al. The RAJA Portability Layer: Overview and Status , 2014 .
[9] Samuel Williams,et al. ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..
[10] Laxmi N. Bhuyan,et al. CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs , 2016, ICS.
[11] Mauro Bianco,et al. A Generic Strategy for Multi-stage Stencils , 2014, Euro-Par.
[12] Mohamed Wahib,et al. Daino: A High-Level Framework for Parallel and Efficient AMR on GPUs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[13] John Shalf,et al. Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.
[14] Satoshi Matsuoka,et al. CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.
[15] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[16] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Yong-Jun Lee,et al. Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Mitsuhisa Sato,et al. XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters , 2014, 2014 First Workshop on Accelerator Programming using Directives.
[20] Michael Wolfe,et al. Implementing the PGI Accelerator model , 2010, GPGPU-3.
[21] John Shalf,et al. TiDA: High-Level Programming Abstractions for Data Locality Management , 2016, ISC.
[22] John Shalf,et al. BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework , 2016, SIAM J. Sci. Comput..
[23] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[24] Jacqueline Chame,et al. A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.
[25] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[26] David A. Padua,et al. Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.
[27] H. Carter Edwards,et al. Kokkos: Enabling Performance Portability Across Manycore Architectures , 2013, 2013 Extreme Scaling Workshop (xsw 2013).
[28] Ade Miller,et al. C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++ , 2012 .
[29] Ronan Keryell,et al. Khronos SYCL for OpenCL: a tutorial , 2015, IWOCL.
[30] Zehra Sura,et al. Towards Performance Portable GPU Programming with RAJA [ Extended Abstact ] , 2015 .