Warp-aware trace scheduling for GPUs
暂无分享,去创建一个
Maurice Herlihy | Onur Mutlu | Thomas B. Jablin | James A. Jablin | O. Mutlu | T. Jablin | Maurice Herlihy
[1] Scott A. Mahlke,et al. Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.
[2] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[3] Paolo Faraboschi,et al. Instruction scheduling for instruction level parallel processors , 2001, Proc. IEEE.
[4] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[5] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[6] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[7] Scott A. Mahlke,et al. The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.
[8] Joseph A. Fisher,et al. Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.
[9] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[10] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[11] John R. Ellis,et al. Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific) , 1985 .
[12] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[13] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[14] Scott A. Mahlke,et al. IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors , 1998, 25 Years ISCA: Retrospectives and Reprints.
[15] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[16] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[17] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.
[18] Michael D. Smith,et al. Boosting beyond static scheduling in a superscalar processor , 1990, ISCA '90.
[19] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[20] Tianyi David Han,et al. Reducing branch divergence in GPU programs , 2011, GPGPU-4.
[21] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[22] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[23] Michael D. Smith,et al. Support for Speculative Execution in High-Performance Processors , 1992 .
[24] References , 1971 .
[25] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .
[26] James R. Larus,et al. Branch prediction for free , 1993, PLDI '93.