Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards

In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to graphics hardware as well as double buffering concepts to hide memory transfers. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine offline the best configuration. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of more than \(145\times \) can be achieved on NVIDIA’s Tesla C1060 compared to a parallelized implementation on a Xeon Quad Core. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead to the graphics card is reduced by a factor of six using double buffering.

[1]  Weiping Li,et al.  Overview of fine granularity scalability in MPEG-4 video standard , 2001, IEEE Trans. Circuits Syst. Video Technol..

[2]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[3]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[4]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[6]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[7]  Rolf Ernst,et al.  An image processor for digital film , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[8]  Til Aach,et al.  Nonlinear multiresolution gradient adaptive filter for medical images , 2003, SPIE Medical Imaging.

[9]  Sam S. Stone,et al.  Program Optimization Study on a 128-Core GPU , 2011 .

[10]  Uday Bondhugula,et al.  Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  Bülent Sankur,et al.  ARTICLE IN PRESS Image and Vision Computing xx (2005) 1–9 www.elsevier.com/locate/imavis , 2004 .

[13]  Touradj Ebrahimi,et al.  The JPEG2000 still image coding system: an overview , 2000, IEEE Trans. Consumer Electron..

[14]  Jürgen Teich,et al.  Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors , 2009, SAMOS.

[15]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[16]  Jürgen Teich,et al.  A Design Methodology for Hardware Acceleration of Adaptive Filter Algorithms in Image Processing , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).