Benchmarking the GPU memory at the warp level
暂无分享,去创建一个
Yuangang Wang | Haifang Zhou | Weimin Zhang | Jianbin Fang | Minquan Fang | Jianxing Liao | Yuangang Wang | Jianbin Fang | Haifang Zhou | Minquan Fang | Weimin Zhang | Jianxing Liao
[1] Dong Li,et al. PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[2] Xinxin Mei,et al. Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.
[3] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.
[4] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[5] Carl Staelin,et al. Memory hierarchy performance measurement of commercial dual-core desktop processors , 2008, J. Syst. Archit..
[6] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[7] Hao Chi,et al. Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs , 2014, BMC Bioinformatics.
[8] Ben H. H. Juurlink,et al. Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency , 2015, TACO.
[9] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.
[10] 张卫民,et al. A parallel algorithm of FastICA dimensionality reduction for hyperspectral image on GPU , 2015 .
[11] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[12] Jianbin Fang,et al. Test-driving Intel Xeon Phi , 2014, ICPE.
[13] Henk Corporaal,et al. A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[14] Yun Liang,et al. An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[15] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[16] Wen-mei W. Hwu,et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.
[17] Xiaowen Chu,et al. G-BLASTN: accelerating nucleotide alignment by graphics processors , 2014, Bioinform..
[18] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[19] Yun Liang,et al. An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2016, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..
[20] Bo Wu,et al. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.
[21] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[22] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[23] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[24] Lianru Gao,et al. Real-time implementation of optimized maximum noise fraction transform for feature extraction of hyperspectral images , 2014 .
[25] Alan Jay Smith,et al. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.
[26] Nicholas Wilt,et al. The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .
[27] Gagan Agrawal,et al. An integer programming framework for optimizing shared memory use on GPUs , 2010, 2010 International Conference on High Performance Computing.
[28] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.