Supporting input dependent access pattern algorithms on GPUs using GPUfs

Accelerating processing of very large datasets on GPUs is challenging, in particular when algorithms exhibit unpredictable data access patterns. In this paper we investigate the utility of GPUfs, a library that provides direct access to files from GPU programs, to implement such algorithms. We analyze the system’s bottlenecks, and suggest several modification to the GPUfs design, including new concurrent hash table for the buffer cache and a highly parallel memory allocator. We evaluate our changes by implementing a real image processing application which creates collages from a dataset of 2 Million images. The enhanced GPUfs design improves the application performance by 2× over the original GPUfs and outperforms both 12-core parallel CPU and standard CUDA-based GPU implementations, while significantly simplifying GPU application design.