An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization
暂无分享,去创建一个
[1] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[2] Yun Liang,et al. An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.
[3] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[4] Yun Liang,et al. Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[5] Yangdong Deng,et al. Taming irregular EDA applications on GPUs , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.
[6] Lin Ma,et al. Analysis of classic algorithms on GPUs , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).
[7] Xipeng Shen,et al. Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.
[8] Dong Hyuk Woo,et al. SIMD divergence optimization through intra-warp compaction , 2013, ISCA.
[9] Shengkui Zhao,et al. Real-time implementation and performance optimization of 3D sound localization on GPUs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[10] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[11] Shane Ryoo,et al. Program Optimization Strategies for Data-Parallel Many-Core Processors , 2008 .
[12] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[13] Minh N. Do,et al. A revisit to cost aggregation in stereo matching: How far can we reduce its computational redundancy? , 2011, 2011 International Conference on Computer Vision.
[14] T. R. P. Siriwardena,et al. Accelerating global sequence alignment using CUDA compatible multi-core GPU , 2010, 2010 Fifth International Conference on Information and Automation for Sustainability.
[15] Mattan Erez,et al. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation , 2013, ISCA.
[16] Yu Wang,et al. Run-time technique for simultaneous aging and power optimization in GPGPUs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).
[17] Hiren D. Patel,et al. On the use of GP-GPUs for accelerating compute-intensive EDA applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[18] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[19] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.
[20] Ali Akoglu,et al. Sequence alignment with GPU: Performance and design challenges , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[21] Lin Ma,et al. Theoretical analysis of classic algorithms on highly-threaded many-core GPUs , 2014, PPoPP '14.
[22] Yun Liang,et al. An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[23] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[24] William E. Lorensen,et al. Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.
[25] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[26] Yu Wang,et al. Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[27] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[28] Krste Asanovic,et al. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[29] Mattan Erez,et al. The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[30] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[31] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.
[32] Yun Liang,et al. Register and thread structure optimization for GPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).
[33] Tor M. Aamodt,et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.
[34] Dongrui Fan,et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).