Benchmarking the GPU memory at the warp level

Abstract Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level. In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective. Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system.

[1]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Xinxin Mei,et al.  Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[3]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[5]  Carl Staelin,et al.  Memory hierarchy performance measurement of commercial dual-core desktop processors , 2008, J. Syst. Archit..

[6]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[7]  Hao Chi,et al.  Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs , 2014, BMC Bioinformatics.

[8]  Ben H. H. Juurlink,et al.  Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency , 2015, TACO.

[9]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[10]  张卫民,et al.  A parallel algorithm of FastICA dimensionality reduction for hyperspectral image on GPU , 2015 .

[11]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[12]  Jianbin Fang,et al.  Test-driving Intel Xeon Phi , 2014, ICPE.

[13]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[16]  Wen-mei W. Hwu,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[17]  Xiaowen Chu,et al.  G-BLASTN: accelerating nucleotide alignment by graphics processors , 2014, Bioinform..

[18]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[19]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2016, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[20]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[21]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[22]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[23]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[24]  Lianru Gao,et al.  Real-time implementation of optimized maximum noise fraction transform for feature extraction of hyperspectral images , 2014 .

[25]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[26]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[27]  Gagan Agrawal,et al.  An integer programming framework for optimizing shared memory use on GPUs , 2010, 2010 International Conference on High Performance Computing.

[28]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.