Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency
暂无分享,去创建一个
[1] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[2] Yi Yang,et al. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement , 2013, ICS '13.
[3] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[4] Jeffrey R. Diamond,et al. Arbitrary Modulus Indexing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[5] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[6] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[7] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.
[8] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[9] Sylvain Collange,et al. Identifying scalar behavior in CUDA kernels , 2011 .
[10] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[11] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[12] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[13] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[14] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[15] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[16] Christopher Torng,et al. Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.
[17] Biao Wang,et al. Parallel H.264/AVC Motion Compensation for GPUs Using OpenCL , 2015, IEEE Transactions on Circuits and Systems for Video Technology.
[18] Tor M. Aamodt,et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.
[19] Dong Hyuk Woo,et al. SIMD divergence optimization through intra-warp compaction , 2013, ISCA.
[20] Mike Mantor,et al. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).
[21] William J. Dally,et al. Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[22] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[23] Matei Ripeanu,et al. StoreGPU: exploiting graphics processing units to accelerate distributed storage systems , 2008, HPDC '08.
[24] Biao Wang,et al. An Optimized Parallel IDCT on Graphics Processing Units , 2012, Euro-Par Workshops.
[25] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[26] Christopher Batten,et al. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..
[27] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[28] Baojin Wang,et al. State-of-Charge Estimation for Lithium-Ion Batteries Based on a Nonlinear Fractional Model , 2017, IEEE Transactions on Control Systems Technology.
[29] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[30] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).