Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
暂无分享,去创建一个
[1] Zhao Zhang,et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.
[2] Xipeng Shen,et al. Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.
[3] Mattan Erez,et al. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[4] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.
[5] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[6] Thomas E. Anderson,et al. High speed switch scheduling for local area networks , 1992, ASPLOS V.
[7] Ken Kennedy,et al. Conversion of control dependence to data dependence , 1983, POPL '83.
[8] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[9] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[10] Matei Ripeanu,et al. Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Nancy M. Amato,et al. Quantifying the effectiveness of load balance algorithms , 2012, ICS '12.
[12] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[13] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[14] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[15] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[16] Atilla Eryilmaz,et al. Exploring the Throughput Boundaries of Randomized Schedulers in Wireless Networks , 2012, IEEE/ACM Transactions on Networking.
[17] Donald F. Ferguson,et al. Microeconomic algorithms for load balancing in distributed computer systems , 1988, [1988] Proceedings. The 8th International Conference on Distributed.
[18] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[19] B. Ramakrishna Rau,et al. Pseudo-randomly interleaved memory , 1991, ISCA '91.
[20] W. Dally,et al. Efficient conditional operations for data-parallel architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.
[21] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[22] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.
[23] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[24] Lei Liu,et al. A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[25] Ahmed Sameh,et al. The Illiac IV system , 1972 .
[26] Sudhakar Yalamanchili,et al. SIMD re-convergence at thread frontiers , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[27] Paolo Giaccone,et al. Efficient Randomized Algorithms for Input-Queued Switch Scheduling , 2002, IEEE Micro.
[28] Wei-Fen Lin,et al. Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.
[29] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[30] Mattan Erez,et al. The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[31] P. J. Narayanan,et al. Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.
[32] Sudhakar Yalamanchili,et al. Characterization and Transformation of Unstructured Control Flow in GPU Applications , 2011 .
[33] Kevin Skadron,et al. Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.