Implementation of Non Local Means Filter in GPUs

In this paper, we review some alternatives to reduce the computational complexity of the Non-Local Means image filter and present a CUDA-based implementation of it for GPUs, comparing its performance on different GPUs and with respect to reference CPU implementations. Starting from a naive CUDA implementation, we describe different aspects of CUDA and the algorithm itself that can be leveraged to decrease the execution time. Our GPU implementation achieved speedups of up to 35.8x with respect to our reduced-complexity reference implementation on the CPU, and more than 700x over a plain CPU implementation.

[1]  Mehran Ebrahimi,et al.  Efficient nonlocal-means denoising using the SVD , 2008, 2008 15th IEEE International Conference on Image Processing.

[2]  Tolga Tasdizen,et al.  Principal Neighborhood Dictionaries for Nonlocal Means Image Denoising , 2009, IEEE Transactions on Image Processing.

[3]  Aleksandra Pizurica,et al.  A GPU-Accelerated Real-Time NLMeans Algorithm for Denoising Color Video Sequences , 2010, ACIVS.

[4]  Jérôme Darbon,et al.  Fast nonlocal filtering applied to electron cryomicroscopy , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[5]  Hao Wu,et al.  Fast CT Image Processing Using Parallelized Non-local Means , 2011 .

[6]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Klaus Mueller,et al.  Performance Tuning for CUDA-Accelerated Neighborhood Denoising Filters , 2011 .

[8]  Victor Podlozhnyuk,et al.  Image Convolution with CUDA , 2007 .