Warp-Consolidation: A Novel Execution Model for GPUs
暂无分享,去创建一个
Shuaiwen Song | Weifeng Liu | Ang Li | Kevin J. Barker | Linnan Wang | Ang Li | K. Barker | Linnan Wang | Weifeng Liu | S. Song
[1] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[2] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Eli Ben-Sasson,et al. Fast Multiplication in Binary Fields on GPUs via Register Cache , 2016, ICS.
[4] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[5] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[6] Andrew A. Davidson,et al. Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).
[7] Henk Corporaal,et al. Critical points based register-concurrency autotuning for GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[8] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[9] Yi Yang,et al. Shared memory multiplexing: A novel way to improve GPGPU throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[10] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[11] Jin Wang,et al. Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[12] Keshav Pingali,et al. Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[13] Mahmut T. Kandemir,et al. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[14] Shuaiwen Song,et al. CUDAAdvisor: LLVM-based runtime profiling for modern GPUs , 2018, CGO.
[15] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[16] Vivek Sarkar,et al. Optimized two-level parallelization for GPU accelerators using the polyhedral model , 2017, CC.
[17] David A. Wood,et al. Fine-grain task aggregation and coordination on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[18] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.
[19] Michael F. P. O'Boyle,et al. A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[20] Yi Yang,et al. A unified optimizing compiler framework for different GPGPU architectures , 2012, TACO.
[21] Henk Corporaal,et al. Locality-Aware CTA Clustering for Modern GPUs , 2017, ASPLOS.
[22] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[23] Anne C. Elster,et al. Register Caching for Stencil Computations on GPUs , 2014, 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.
[24] Henk Corporaal,et al. Transit: A Visual Analytical Model for Multithreaded Machines , 2015, HPDC.
[25] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[26] Mohammad Abdel-Majeed,et al. Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[27] Henk Corporaal,et al. X: A Comprehensive Analytic Model for Parallel Machines , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[28] Ari B. Hayes,et al. Unified on-chip memory allocation for SIMT architecture , 2014, ICS '14.
[29] Michael F. P. O'Boyle,et al. Automatic optimization of thread-coarsening for graphics processors , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[30] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.
[31] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[32] Brian Vinter,et al. Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides , 2017, Concurr. Comput. Pract. Exp..
[33] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[34] John D. Owens,et al. Efficient Synchronization Primitives for GPUs , 2011, ArXiv.
[35] Carole-Jean Wu,et al. CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[36] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[37] Brian Vinter,et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.
[38] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[39] John D. Owens,et al. Register packing for cyclic reduction: a case study , 2011, GPGPU-4.
[40] Dongrui Fan,et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.
[41] Shuaiwen Song,et al. BVF: Enabling Significant On-Chip Power Savings via Bit-Value-Favor for Throughput Processors , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[42] Henk Corporaal,et al. Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[43] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[44] Hao Wang,et al. Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[45] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[46] Sudhakar Yalamanchili,et al. SIMD re-convergence at thread frontiers , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[47] Yingwei Luo,et al. Barrier-Aware Warp Scheduling for Throughput Processors , 2016, ICS.
[48] Henk Corporaal,et al. Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.