论文信息 - Multithreaded double queuing for balanced CPU-GPU memory copying

Multithreaded double queuing for balanced CPU-GPU memory copying

Memory transfers between CPU host and GPU devices are known to impose heavy overhead on GPU applications. For GPU applications with large data inputs, memory transfers frequently take much longer than kernel execution time. For memory transfer mechanism for unpinned pages, current mechanism in CUDA library fails to achieve the maximum memory transfer potential. Memory bandwidth discrepancy between CPU and GPU are not properly considered in the current CUDA library. In our work, we propose a multithreaded memory copy technique that uses double queuing to fully utilize the PCIe bandwidth in GPU memory during the data transfers between host CPU and GPU device. Our technique doubles the memory transfer rate of current CUDA memory transfer in cudaMemcpy() by allowing multiple devices with different bandwidths to operate in a balanced manner.

[1] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[2] Donald E. Knuth,et al. The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[3] Nicholas Wilt,et al. The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[4] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[5] Remzi H. Arpaci-Dusseau. Operating Systems: Three Easy Pieces , 2015, login Usenix Mag..

[6] Martin Lilleeng Sætra,et al. Graphics processing unit (GPU) programming strategies and trends in GPU computing , 2013, J. Parallel Distributed Comput..

[7] Jan Reineke,et al. Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[8] Donald E. Knuth,et al. The art of computer programming: sorting and searching (volume 3) , 1973 .

[9] Jaewon Lee,et al. DCS: A fast and scalable device-centric server architecture , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).