Static code transformations for thread‐dense memory accesses in GPU computing
暂无分享,去创建一个
Hwansoo Han | Hyun-Jun Kim | Sungin Hong | Jeonghwan Park | Hwansoo Han | Hyunjun Kim | Sungin Hong | Jeonghwan Park
[1] Jaejin Lee,et al. Design and implementation of software-managed caches for multicores with local memory , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.
[2] Yi Yang,et al. Shared memory multiplexing: A novel way to improve GPGPU throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[3] Mahmut T. Kandemir,et al. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[4] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.
[5] Yooseong Kim,et al. CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).
[6] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[7] Richard W. Vuduc,et al. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[8] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9] Mehrzad Samadi,et al. Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.
[10] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.
[11] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[12] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[13] Hyesoon Kim,et al. Spare register aware prefetching for graph algorithms on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[14] Rajeev Alur,et al. Block-Size Independence for GPU Programs , 2018, SAS.
[15] Carole-Jean Wu,et al. CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[16] Jianfei Wang,et al. Incorporating selective victim cache into GPGPU for high‐performance computing , 2017, Concurr. Comput. Pract. Exp..
[17] Ari B. Hayes,et al. Unified on-chip memory allocation for SIMT architecture , 2014, ICS '14.
[18] Rajeev Alur,et al. GPUDrano: Detecting Uncoalesced Accesses in GPU Programs , 2017, CAV.
[19] Donald S. Fussell,et al. Priority-based cache allocation in throughput processors , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[20] Henk Corporaal,et al. Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[21] Nam Sung Kim,et al. CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[22] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[23] Allen D. Malony,et al. Autotuning GPU Kernels via Static and Predictive Analysis , 2017, 2017 46th International Conference on Parallel Processing (ICPP).
[24] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[25] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[26] Terence Parr,et al. LL(*): the foundation of the ANTLR parser generator , 2011, PLDI '11.
[27] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[28] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[29] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[30] Gagan Agrawal,et al. Optimizing MapReduce for GPUs with effective shared memory usage , 2012, HPDC '12.
[31] P. Sadayappan,et al. Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[32] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[33] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[34] Kenli Li,et al. A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear Equations , 2016, IEEE Transactions on Parallel and Distributed Systems.
[35] Jason Cong,et al. A reuse-aware prefetching scheme for scratchpad memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).
[36] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[37] Henk Corporaal,et al. A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[38] Ranjani Parthasarathi,et al. Exploiting GPU memory hierarchy for accelerating a specialized stencil computation , 2017, Concurr. Comput. Pract. Exp..
[39] Feng Ji,et al. Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[40] Reena Panda,et al. Statistical pattern based modeling of GPU memory access streams , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[41] Scott A. Mahlke,et al. APOGEE: Adaptive prefetching on GPUs for energy efficiency , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[42] Zhen Lin,et al. Automatic data placement into GPU on-chip memory resources , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[43] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.