Is Data Placement Optimization Still Relevant on Newer GPUs?

Modern supercomputers often use Graphic Processing Units (or GPUs) to meet the evergrowing demands for energy efficient high performance computing. GPUs have a complex memory architecture with various types of memories and caches, in particular global memory, shared memory, constant memory, and texture memory. Data placement optimization, i.e. optimizing the placement of data among these different memories, has a significant impact on Hie performance of HPC applications running on early generations of GPUs. However, newer generations of GPUs implement the same high-level memory hierarchy differently and have new memory features. In this paper, we design a set of experiments to explore the relevance of data placement optimizations on several generations of NVIDIA GPUs, including Kepler, Maxwell, Pascal, and Volta. Our experiments include a set of memory microbenchmarks, CUDA kernels and a proxy application. The experiments are configured to include different CUDA thread blocks, data input sizes, and data placement choices. The results show that newer generations of GPUs are less sensitive to data placement optimization compared to older ones, mostly due to improvements to global memory caches

[1]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[2]  W. Marsden I and J , 2012 .

[3]  Ian Karlin,et al.  LULESH Programming Model and Performance Ports Overview , 2012 .

[4]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[5]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  B. Mohr,et al.  GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models , 2016 .

[7]  Yuangang Wang,et al.  Benchmarking the GPU memory at the warp level , 2018, Parallel Comput..

[8]  Dong Li,et al.  Optimizing Data Placement on GPU Memory: A Portable Approach , 2017, IEEE Transactions on Computers.

[9]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[10]  nVIDIA社 CUDA Programming Guide 1.1 , 2007 .

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Yannis Cotronis,et al.  A Quantitative Performance Evaluation of Fast on-Chip Memories of GPUs , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[13]  Dong Li,et al.  Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.