Accelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU

Because of the high data dependency between ADL (adaptive directional lifting) operations, such as interpolation, directional prediction and update, the existing CUDA-specific (Compute Unified Device Architecture) implementation of traditional rectilinear lifting-based transform is difficult to be used for ADL-based transform. This paper proposes a novel CUDA-specific method named Slice for implementation of the ADL-based wavelet transforms on GPU (Graphics Processing Unit). Compared with the previous CUDA-specific methods the proposed method makes each step handled by a different kernel to avoid unnecessary waiting time between lifting steps. Meanwhile the interpolation and decomposition including prediction and update are executed in an interleaving style for each filtered pixel. Moreover, the coalesced memory accesses are exploited to the greatest extent by coalesced reading a slice of data to the shared memory and coalesced writing them back to the global memory after being processed. The results show that the Slice method overcomes the limitation of high data dependency between the lifting steps and achieves more than 10 times speedup compared to the optimized CPU implementation for the ADL-based transform.

[1]  Jos B. T. M. Roerdink,et al.  Accelerating Wavelet Lifting on Graphics Hardware Using CUDA , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Hefei Ling,et al.  A New Design Method of 9-7 Biorthogonal Filter Banks Based on Odd Harmonic Function , 2012, Circuits Syst. Signal Process..

[3]  Chin-Chen Chang,et al.  Removing blocking effects using an artificial neural network , 2006, Signal Process..

[4]  Fadi J. Kurdahi,et al.  A scalable embedded JPEG 2000 architecture , 2007, J. Syst. Archit..

[5]  I. Daubechies,et al.  Factoring wavelet transforms into lifting steps , 1998 .

[6]  Sudipta Mahapatra,et al.  Efficient FPGA implementation of DWT and modified SPIHT for lossless image compression , 2007, J. Syst. Archit..

[7]  Francisco Tirado,et al.  Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting , 2008, IEEE Transactions on Parallel and Distributed Systems.

[8]  Da Qi Ren,et al.  Algorithm level power efficiency optimization for CPU-GPU processing element in data intensive SIMD/SPMD computing , 2011, J. Parallel Distributed Comput..

[9]  Anatoli Torokhti,et al.  Filtering of infinite sets of stochastic signals: An approach based on interpolation techniques , 2011, Signal Process..

[10]  Bernd Girod,et al.  Direction-Adaptive Discrete Wavelet Transform for Image Compression , 2007, IEEE Transactions on Image Processing.

[11]  Helmar Burkhart,et al.  Algorithmic performance studies on graphics processing units , 2008, J. Parallel Distributed Comput..

[12]  Qiang Zhang,et al.  Multifocus image fusion using the nonsubsampled contourlet transform , 2009, Signal Process..

[13]  Patrice Abry,et al.  Wavelet leaders and bootstrap for multifractal analysis of images , 2009, Signal Process..

[14]  Andreas Uhl,et al.  High performance JPEG 2000 and MPEG-4 VTC on SMPs using OpenMP , 2005, Parallel Comput..

[15]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[16]  Isidore Paul Akam Bita,et al.  On optimal orthogonal transforms at high bit-rates using only second order statistics in multicomponent image coding with JPEG2000 , 2010, Signal Process..

[17]  R. Hingorani,et al.  Direct Fourier reconstruction in computer tomography , 1981 .

[18]  Lars Karlsson,et al.  Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..

[19]  Weng-Fai Wong,et al.  Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[20]  Xia Tao,et al.  A New Design Method of 9-7 Biorthogonal Filter Banks Based on Odd Harmonic Function , 2012 .

[21]  King Ngi Ngan,et al.  Weighted Adaptive Lifting-Based Wavelet Transform for Image Coding , 2008, IEEE Transactions on Image Processing.

[22]  Yi-Ching Liaw,et al.  Artifact reduction of JPEG coded images using mean-removed classified vector quantization , 2002, Signal Process..

[23]  Feng Wu,et al.  Adaptive Directional Lifting-Based Wavelet Transform for Image Coding , 2007, IEEE Transactions on Image Processing.

[24]  O. Kao,et al.  On parallel image retrieval with dynamically extracted features , 2008, Parallel Comput..

[25]  Emmanuel Casseau,et al.  Design of a flexible 2-D discrete wavelet transform IP core for JPEG2000 image coding in embedded imaging systems , 2006, Signal Process..

[26]  Habib Hamam,et al.  A new approach for optical colored image compression using the JPEG standards , 2007, Signal Process..