An Evaluation of Unified Memory Technology on NVIDIA GPUs

Unified Memory is an emerging technology which is supported by CUDA 6.X. Before CUDA 6.X, the existing CUDA programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. CUDA 6.X provides a new technology which is called as Unified Memory to provide a new programming model that defines CPU and GPU memory space as a single coherent memory (imaging as a same common address space). The system manages data access between CPU and GPU without explicit memory copy functions. This paper is to evaluate the Unified Memory technology through different applications on different GPUs to show the users how to use the Unified Memory technology of CUDA 6.X efficiently. The applications include Diffusion3D Benchmark, Parboil Benchmark Suite, and Matrix Multiplication from the CUDA SDK Samples. We changed those applications to corresponding Unified Memory versions and compare those with the original ones. We selected the NVIDIA Keller K40 and the Jetson TK1, which can represent the latest GPUs with Keller architecture and the first mobile platform of NVIDIA series with Keller GPU. This paper shows that Unified Memory versions cause 10% performance loss on average. Furthermore, we used the NVIDIA Visual Profiler to dig the reason of the performance loss by the Unified Memory technology.

[1]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[2]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[3]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[4]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[5]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[6]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Andrew S. Tanenbaum,et al.  Operating systems: design and implementation , 1987, Prentice-Hall software series.

[8]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[9]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[10]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[11]  Andrew S. Tanenbaum,et al.  Operating systems - design and implementation, 3rd Edition , 2005 .

[12]  Steven S. Lumetta,et al.  CUBA: an architecture for efficient CPU/co-processor data communication , 2008, ICS '08.