A Case for a Flexible Scalar Unit in SIMT Architecture
暂无分享,去创建一个
Qunfeng Dong | Huiyang Zhou | Norman Rubin | Ping Xiang | Yi Yang | Mike Mantor | Lisa Hsu | Lisa R. Hsu | Norman Rubin | Yi Yang | Mike Mantor | Huiyang Zhou | Ping Xiang | Qunfeng Dong
[1] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[2] Christopher Hughes,et al. Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.
[3] Mattan Erez,et al. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation , 2013, ISCA.
[4] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.
[5] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[6] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[7] Brian Kingsbury,et al. Spert-II: A Vector Microprocessor System , 1996, Computer.
[8] Xipeng Shen,et al. Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.
[9] Naga K. Govindaraju,et al. High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.
[10] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[11] Christopher Batten,et al. The Vector-Thread Architecture , 2004, ISCA 2004.
[12] Matthew Mattina,et al. Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.
[13] Yi Yang,et al. CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[14] David A. Patterson,et al. Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..
[15] Richard W. Vuduc,et al. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[16] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[17] Mateo Valero,et al. Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[18] Rajeev Balasubramonian,et al. Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.
[19] Josep Torrellas,et al. Using a user-level memory thread for correlation prefetching , 2002, ISCA.
[20] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[21] Christopher Batten,et al. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..
[22] Christoforos E. Kozyrakis,et al. Overcoming the limitations of conventional vector processors , 2003, ISCA '03.
[23] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[24] Mateo Valero,et al. Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.
[25] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[26] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[27] Donald Yeung,et al. Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.
[28] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[29] Naga K. Govindaraju,et al. High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[30] Hsien-Hsin S. Lee,et al. COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.
[31] Rudolf Eigenmann,et al. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.
[32] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[33] Yao Zhang,et al. Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.
[34] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).