Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping
暂无分享,去创建一个
Xipeng Shen | Eddy Z. Zhang | Ziyu Guo | Yunlian Jiang | Yunlian Jiang | Xipeng Shen | E. Zhang | Ziyu Guo
[1] Satoshi Matsuoka,et al. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[3] Ye Zhao,et al. Lattice Boltzmann based PDE solver on the GPU , 2008, The Visual Computer.
[4] Trishul M. Chilimbi,et al. Cache-conscious coallocation of hot data streams , 2006, PLDI '06.
[5] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[6] Xiaoming Li,et al. A control-structure splitting optimization for GPGPU , 2009, CF '09.
[7] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[8] Uday Bondhugula,et al. A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.
[9] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[10] Chau-Wen Tseng,et al. Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.
[11] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[12] Wen-mei W. Hwu,et al. Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.
[13] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[14] Toshio Endo,et al. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, HiPC 2008.
[15] Ken Kennedy,et al. Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.
[16] Xipeng Shen,et al. A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[17] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[18] Dorit S. Hochbaum,et al. Approximation Algorithms for NP-Hard Problems , 1996 .
[19] Anjul Patney,et al. Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.
[20] Naga K. Govindaraju,et al. Fast scan algorithms on graphics processors , 2008, ICS '08.