Multithreaded double queuing for balanced CPU-GPU memory copying

Memory transfers between CPU host and GPU devices are known to impose heavy overhead on GPU applications. For GPU applications with large data inputs, memory transfers frequently take much longer than kernel execution time. For memory transfer mechanism for unpinned pages, current mechanism in CUDA library fails to achieve the maximum memory transfer potential. Memory bandwidth discrepancy between CPU and GPU are not properly considered in the current CUDA library. In our work, we propose a multithreaded memory copy technique that uses double queuing to fully utilize the PCIe bandwidth in GPU memory during the data transfers between host CPU and GPU device. Our technique doubles the memory transfer rate of current CUDA memory transfer in cudaMemcpy() by allowing multiple devices with different bandwidths to operate in a balanced manner.