Improving the memory behavior of vertical filtering in the discrete wavelet transform

The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is well-known that a straightforward implementation of vertical filtering (assuming a row-major layout) induces many cache misses, due to lack of spatial locality. This can be avoided by interchanging the loops. This paper shows, however, that the resulting implementation suffers significantly from 64K aliasing, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart, and we propose two techniques to avoid it. In addition, if the filter length is longer than four, the number of ways of the L1 data cache of the Pentium 4 is insufficient to avoid cache conflict misses. Consequently, we propose two methods for reducing conflict misses. Although experimental results have been collected on the Pentium 4, the techniques are general and can be applied to other processors with different cache organizations as well. The proposed techniques improve the performance of vertical filtering compared to already optimized baseline implementations by a factor of 3.11 for the (5,3) lifting scheme, 3.11 for Daubechies' transform of four coefficients, and by a factor of 1.99 for the Cohen, Daubechies, and Feauveau 9/7 transform.

[1]  Francisco Tirado Fernández,et al.  2-D wavelet transform enhancement on general-purpose microprocessors: memory hierarchy and SIMD parallelism exploitation , 2002 .

[2]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[3]  Majid Rabbani,et al.  An overview of the JPEG 2000 still image compression standard , 2002, Signal Process. Image Commun..

[4]  Francisco Argüello,et al.  A memory system supporting the efficient SIMD computation of the two dimensional DWT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Touradj Ebrahimi,et al.  JPEG2000: The upcoming still image compression standard , 2001, Pattern Recognit. Lett..

[6]  Michael W. Marcellin,et al.  An overview of JPEG-2000 , 2000, Proceedings DCC 2000. Data Compression Conference.

[7]  Francisco Tirado,et al.  Vectorization of the 2D wavelet lifting transform using SIMD extensions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[8]  Sung-Jea Ko,et al.  Real-time DSP implementation of motion-JPEG2000 using overlapped block transferring and parallel-pass methods , 2004, Real Time Imaging.

[9]  Christopher Brooks,et al.  Cache-efficient wavelet lifting in JPEG 2000 , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[10]  David B. Stewart Measuring Execution Time and Real-Time Performance , 2001 .

[11]  Stamatis Vassiliadis,et al.  Performance comparison of SIMD implementations of the discrete wavelet transform , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[12]  Andreas Uhl,et al.  Cache issues with JPEG2000 wavelet lifting , 2002, IS&T/SPIE Electronic Imaging.

[13]  David R. O'Hallaron,et al.  Computer Systems: A Programmer's Perspective , 1991 .

[14]  I. Daubechies,et al.  Factoring wavelet transforms into lifting steps , 1998 .

[15]  I. Daubechies,et al.  Biorthogonal bases of compactly supported wavelets , 1992 .

[16]  W. Sweldens The Lifting Scheme: A Custom - Design Construction of Biorthogonal Wavelets "Industrial Mathematics , 1996 .

[17]  José González,et al.  Reducing 3D wavelet transform execution time through the Streaming SIMD Extensions , 2003, Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings..

[18]  Antonio Ortega,et al.  Line-based, reduced memory, wavelet image compression , 2000, IEEE Trans. Image Process..