Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying loop interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

[1]  Jack J. Purdum,et al.  C programming guide , 1983 .

[2]  I. Daubechies,et al.  Biorthogonal bases of compactly supported wavelets , 1992 .

[3]  W. Sweldens The Lifting Scheme: A Custom - Design Construction of Biorthogonal Wavelets "Industrial Mathematics , 1996 .

[4]  Francisco Argüello,et al.  A memory system supporting the efficient SIMD computation of the two dimensional DWT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Antonio Ortega,et al.  Line-based, reduced memory, wavelet image compression , 2000, IEEE Trans. Image Process..

[6]  Faouzi Kossentini,et al.  Reversible integer-to-integer wavelet transforms for image compression: performance evaluation and analysis , 2000, IEEE Trans. Image Process..

[7]  Marco Ferretti,et al.  A Parallel Architecture for the 2-D Discrete Wavelet Transform with Integer Lifting Scheme , 2001, J. VLSI Signal Process..

[8]  A. Ortega,et al.  Analysis of cache efficiency in 2D wavelet transform , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[9]  Paul M. Chau,et al.  Reduce complexity hardware implementation of discrete wavelet transform for JPEG 2000 standard , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[10]  Francisco Tirado,et al.  2-D Wavelet Transform Enhancement on General- Purpose Microprocessors: Memory Hierarchy and SIMD , 2002 .

[11]  Gauthier Lafruit,et al.  Cache misses and energy-dissipation results for JPEG-2000 filtering , 2002, 2002 14th International Conference on Digital Signal Processing Proceedings. DSP 2002 (Cat. No.02TH8628).

[12]  Francisco Tirado,et al.  -D Wavelet Transform Enhancement on General-Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation , 2002, HiPC.

[13]  Andreas Uhl,et al.  Cache issues with JPEG2000 wavelet lifting , 2002, IS&T/SPIE Electronic Imaging.

[14]  Christopher Brooks,et al.  Cache-efficient wavelet lifting in JPEG 2000 , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[15]  Francisco Tirado,et al.  Vectorization of the 2D wavelet lifting transform using SIMD extensions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[16]  Rabab Kreidieh Ward,et al.  JasPer: a portable flexible open-source software tool kit for image coding/processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Sung-Jea Ko,et al.  Cache management for wavelet lifting in JPEG 2000 running on DSP , 2004 .

[18]  Stamatis Vassiliadis,et al.  Performance comparison of SIMD implementations of the discrete wavelet transform , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[19]  Rade Kutil A single-loop approach to SIMD parallelization of 2D wavelet lifting , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[20]  Stamatis Vassiliadis,et al.  Improving the memory behavior of vertical filtering in the discrete wavelet transform , 2006, CF '06.

[21]  Stamatis Vassiliadis,et al.  Avoiding Conversion and Rearrangement Overhead in SIMD Architectures , 2006, International Journal of Parallel Programming.