High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs

Graphics Processing Units (GPUs) are suitable for highly data parallel algorithms such as image processing, due to their massive parallel processing power. Many image processing applications use the histogramming algorithm, which fills a set of bins according to the frequency of occurrence of pixel values taken from an input image. Histogramming has been mapped on a GPU prior to this work. Although significant research effort has been spent in optimizing the mapping, we show that the performance and performance predictability of existing methods can still be improved. In this paper, we present two novel histogramming methods, both achieving a higher performance and predictability than existing methods. We discuss performance limitations for both novel methods by exploring algorithm trade-offs. Both the novel and the existing histogramming methods are evaluated for performance. The first novel method gives an average performance increase of 33% over existing methods for non-synthetic benchmarks. The second novel method gives an average performance increase of 56% over existing methods and guarantees to be fully data independent. While the second method is specifically designed for newer GPU architectures, the first method is also suitable for older architectures.

[1]  Simon Green,et al.  Image Processing Tricks in OpenGL , 2005 .

[2]  Zhiyi Yang,et al.  Parallel Image Processing Based on CUDA , 2008, 2008 International Conference on Computer Science and Software Engineering.

[3]  Justin Hensley,et al.  Efficient histogram generation using scattering on GPUs , 2007, SI3D.

[4]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Walter F. Tichy,et al.  Software engineering for multicore systems: an experience report , 2008, IWMSE '08.

[6]  Rodney A. Kennedy,et al.  Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[7]  Daniel Cremers,et al.  GPU histogram computation , 2006, SIGGRAPH '06.

[8]  de G. Haan Digital video post processing , 2006 .

[9]  Steve Mann,et al.  OpenVIDIA: parallel GPU computer vision , 2005, ACM Multimedia.

[10]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[11]  David Patterson,et al.  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .

[12]  Sungdae Cho,et al.  Design and Performance Evaluation of Image Processing Algorithms on GPUs , 2011, IEEE Transactions on Parallel and Distributed Systems.

[13]  Henk Corporaal,et al.  Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries , 2012 .

[14]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.