Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in certain applications, an optimally balanced implementation may limit the occupancy, due to a greater need for registers and shared memory. This is the case of the Fast Generalized Hough Transform (Fast GHT), an image-processing technique for localizing an object within an image. In this work, we present two parallelization alternatives for the Fast GHT, one that optimizes the load balancing and another that maximizes the occupancy. We have compared them using a large amount of real images to test their strong and weak points and we have drawn several conclusions about under which conditions it is better to use one or the other. We have also tackled several parallelization problems related to sparse data distribution, divergent execution paths, and irregular memory access patterns in updating operations by proposing a set of generic techniques, including compacting, sorting, and memory storage replication. Finally, we have compared our Fast GHT with the classic GHT, both on a current GPU, obtaining an important speed-up.

[1]  Ramani Duraiswami,et al.  Canny edge detection on NVIDIA CUDA , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[2]  José M. Palomares,et al.  New edge-based feature extraction algorithm for video segmentation , 2003, IS&T/SPIE Electronic Imaging.

[3]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[4]  Marcus A. Magnor,et al.  A graphics hardware implementation of the generalized Hough transform for fast object recognition, scale, and 3D pose detection , 2003, 12th International Conference on Image Analysis and Processing, 2003.Proceedings..

[5]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[6]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[7]  José Ignacio Benavides Benítez,et al.  Global motion estimation algorithm for video segmentation , 2003, Visual Communications and Image Processing.

[8]  José Ignacio Benavides Benítez,et al.  Parallelization of a Video Segmentation Algorithm on CUDA-Enabled Graphics Processing Units , 2009, Euro-Par.

[9]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[10]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[11]  Victor Podlozhnyuk,et al.  Image Convolution with CUDA , 2007 .

[12]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[13]  Emilio L. Zapata,et al.  Bidimensional shape detection using an invariant approach , 1999, Pattern Recognit..

[14]  O DudaRichard,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972 .

[15]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[16]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[17]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Rodney A. Kennedy,et al.  Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[19]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..