Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs

Abstract GPUs have recently attracted our attention as accelerators on a wide variety of algorithms, including assorted examples within the image analysis field. Among them, wavelets are gaining popularity as solid tools for data mining and video compression, though this comes at the expense of a high computational cost. After proving the effectiveness of the GPU for accelerating the 2D Fast Wavelet Transform  [1] , we present in this paper a novel implementation on manycore GPUs and multicore CPUs for a high performance computation of the 3D Fast Wavelet Transform (3DFWT). This algorithm poses a challenging access pattern on matrix operators demanding high sustainable bandwidth, as well as mathematical functions with remarkable arithmetic intensity on ALUs and FPUs. On the GPU side, we focus on CUDA programming to develop methods for an efficient mapping on manycores and to fully exploit the memory hierarchy, whose management is explicit by the programmer. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. Experimental results on an Nvidia Tesla C870 GPU and an Intel Core 2 Quad Q6700 CPU indicate that our implementation runs three times faster on the Tesla and up to fifteen times faster when communications are neglected, which enables the GPU for processing real-time videos in many applications where the 3D-FWT is involved.

[1]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Andreas Uhl,et al.  Cache issues with JPEG2000 wavelet lifting , 2002, IS&T/SPIE Electronic Imaging.

[3]  Stamatis Vassiliadis,et al.  Optimizing Cache Performance of the Discrete Wavelet Transform Using a Visualization Tool , 2007, ISM 2007.

[4]  Manuel Ujaldon,et al.  The 2D wavelet transform on emerging architectures: GPUs and multicores , 2011, Journal of Real-Time Image Processing.

[5]  Stamatis Vassiliadis,et al.  Improving the memory behavior of vertical filtering in the discrete wavelet transform , 2006, CF '06.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  I. Daubechies Ten Lectures on Wavelets , 1992 .

[8]  Francisco Tirado,et al.  Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting , 2008, IEEE Transactions on Parallel and Distributed Systems.

[9]  J.M. Garcia,et al.  A new lossy 3-D wavelet transform for high-quality compression of medical video , 2000, Proceedings 2000 IEEE EMBS International Conference on Information Technology Applications in Biomedicine. ITAB-ITIS 2000. Joint Meeting Third IEEE EMBS International Conference on Information Technol.

[10]  Andrew Chi-Sing Leung,et al.  Discrete Wavelet Transform on Consumer-Level Graphics Hardware , 2007, IEEE Transactions on Multimedia.

[11]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.