Improved Parallel Image Processing Algorithms by CUDA in GPU Environment

Integral histogram enables constant time histogram computation of an area. Mark Harris proposed a parallel prefix sum algorithm in CUDA GPGPU for integral histogram initialization. Because of the restricted number of threads in a block in CUDA, Harris' algorithm divides a large image into multiple blocks. Such division increases the number of global memory access and becomes a major reason of performance degradation. In this paper we propose an allocation scheme that maps multiple pixels into a thread when the integral histogram is initialized for a large image. The proposed allocation scheme fully utilizes shared memory and reduces the number of accesses to global memory. Experimental results shows that the execution time of the proposed algorithm is 94.7% ~ 99.8% compared to that of Harris' algorithm.