An investigation of Unified Memory Access performance in CUDA

Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations.

[1]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[2]  Bormin Huang,et al.  GPU Acceleration of Predictive Partitioned Vector Quantization for Ultraspectral Sounder Data Compression , 2011, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[3]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[4]  Martin C. Herbordt,et al.  GPU acceleration of a production molecular docking code , 2009, GPGPU-2.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[7]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.