论文信息 - A GPGPU microarchitecture supports multi-path execution and branch compaction - 字舞流文

A GPGPU microarchitecture supports multi-path execution and branch compaction

Yuming Zhang | Chenglu Sun | Z. Tian | Shiwei Jia

[1] Yimen Zhang,et al. A Survey of GPGPU Parallel Processing Architecture Performance Optimization , 2021, 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall).

[2] Tor M. Aamodt,et al. Analyzing Machine Learning Workloads Using a Detailed GPU Simulator , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3] Won Woo Ro,et al. Characterizing convolutional neural network workloads on a detailed GPU simulator , 2017, 2017 International SoC Design Conference (ISOCC).

[4] Murali Annavaram,et al. PATS: Pattern aware scheduling and power gating for GPGPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[5] Mike O'Connor,et al. A scalable multi-path microarchitecture for efficient GPU control flow , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[6] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[7] Mattan Erez,et al. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation , 2013, ISCA.

[8] Mattan Erez,et al. The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[9] Mattan Erez,et al. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[10] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[12] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[13] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[16] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).