Efficient Fork-Join on GPUs Through Warp Specialization
暂无分享,去创建一个
Tian Jin | Kevin O'Brien | Zehra Sura | Tong Chen | Carlo Bertolli | Alexandre E. Eichenberger | Gheorghe-Teodor Bercea | Georgios Rokos | Hyojin Sung | Arpith Chacko Jacob | Samuel Antão | Alexey Bataev | S. Antão | A. Eichenberger | Hyojin Sung | Gheorghe-Teodor Bercea | A. Jacob | Zehra Sura | Tong Chen | C. Bertolli | K. O'Brien | G. Rokos | Tian Jin | Alexey Bataev
[1] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[2] Yi Yang,et al. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.
[3] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[4] Sunita Chandrasekaran,et al. Compiling a High-Level Directive-Based Programming Model for GPGPUs , 2013, LCPC.
[5] Francisco de Sande,et al. accULL: An OpenACC Implementation with CUDA and OpenCL Support , 2012, Euro-Par.
[6] Kevin O'Brien,et al. Coordinating GPU Threads for OpenMP 4.0 in LLVM , 2014, 2014 LLVM Compiler Infrastructure in HPC.
[7] Bronis R. de Supinski,et al. Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.
[8] J. M. Bull,et al. Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .
[9] Wu-chun Feng,et al. Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures , 2015, ICPE.
[10] Yong-Jun Lee,et al. Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[12] Barbara M. Chapman,et al. Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs , 2017, PMAM@PPoPP.
[13] José Ignacio Benavides Benítez,et al. An optimized approach to histogram computation on GPU , 2012, Machine Vision and Applications.
[14] Seyong Lee,et al. OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study , 2014, 2014 First Workshop on Accelerator Programming using Directives.
[15] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[16] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..