Optimal Data Distribution for Versatile Finite Impulse Response Filtering on Next-Generation Graphics Hardware Using CUDA

In this paper, we investigate discrete finite impulse response (FIR) filtering of images, while harnessing the powerful computational resources of next-generation GPUs. These novel platforms exhibit a massive data parallel architecture with an advanced SIMT execution model and thread management, to enable designers to better cope with the infamous memory wall, i.e. the growing gap between the cost of data communication and computational processing. However, the concerning platforms still have hard constraints that prevent trivial optimization of convolution filtering. Although automatic (compiler) optimization is available, we investigate and explain the speedup potential considering manual intervention, given the context of FIR kernels. Furthermore, we present multiple convolution implementation techniques that are able to cope with the hard platform constraints in different situations, while still being able to optimize the implementation to the underlying architecture. Utilizing the acquired insights, a view is given on the impact for possible optimization when loosening these hard constraints in the near future.

[1]  Viktor Öwall,et al.  FPGA implementation of real-time image convolutions with three level of memory hierarchy , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[2]  Ruigang Yang,et al.  A Performance Study on Different Cost Aggregation Approaches Used in Real-Time Stereo Matching , 2007, International Journal of Computer Vision.

[3]  Tzi-cker Chiueh,et al.  An Implementation of a FIR Filter on a GPU , 2005 .

[4]  Victor Podlozhnyuk,et al.  Image Convolution with CUDA , 2007 .

[5]  Mounir Hamdi,et al.  Parallel Image Processing Applications on a Network of Workstations , 1995, Parallel Comput..

[6]  Jan-Michael Frahm,et al.  Real-Time Plane-Sweeping Stereo with Multiple Sweeping Directions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Gauthier Lafruit,et al.  Real-time stereo-based view synthesis algorithms: A unified framework and evaluation on commodity GPUs , 2009, Signal Process. Image Commun..

[8]  Ramani Duraiswami,et al.  Canny edge detection on NVIDIA CUDA , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9]  Lucas J. van Vliet,et al.  Edge preserving orientation adaptive filtering , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[10]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[11]  Saeid Belkasim,et al.  Accelerated 2D Image Processing on GPUs , 2005, International Conference on Computational Science.