A WAVELET-BASED VIDEO CODEC AND ITS PERFORMANCE

Wavelet-based video coding has received much attention and emerged as a viable alternative to the traditional DCT-based hybrid coding scheme. This paper presents a wavelet-based video codec, its fast implementation, and its compression efficiency. The codec consists of motion-compensated temporal filtering (MCTF), 2-D spatial wavelet transform, and extended SPIHT for wavelet coefficient coding. It exploits a new signal extension method and sub-sampling rule to improve the performance of the MCTF. Experimental results show that the codec performs as well as H.264 High Profile (reference software JM10.1 with I and P picture types) in terms of PSNR and bit rates. It generally provides a better subjective picture quality than H.264 because it does not suffer from blocking artefacts. INTRODUCTION The exploding demand for video services, such as HDTV broadcasting, IPTV, mobile TV, and D-cinema, creates a sustained need for more efficient video coding in order to reduce storage and transmission bandwidth. All the existing video coding standards are based on the hybrid coding scheme of motion compensation (MC) and discrete cosine transform (DCT). The latest video coding standard, H.264/MPEG-AVC (1), is a mature and well optimized version of this hybrid scheme. Its compression efficiency of up to 50% higher than MPEG-2 is achieved at the cost of greatly increased complexity. It is unclear how much more compression the hybrid scheme is able to offer to fulfil the requirements of the future video services. Some experts think that the hybrid scheme has nearly achieved its full potential, considering the fact that it has been refined and optimized during the last two decades. Therefore, other solutions must be explored. Wavelet-based video coding has received much attention and emerged as a credible alternative to the traditional hybrid coding scheme (2)-(3). Some experimental results have shown that wavelet-based video coding is able to provide similar compression efficiency to the traditional hybrid coding (4)-(6). An advantage of wavelet-based video coding is that once a video sequence is encoded at a given resolution and quality, a video sequence with lower resolution and quality can be easily obtained from partial decoding of the coded bit stream. This feature, called scalability, allows video delivery over heterogeneous networks to serve clients with various display and processing capabilities. In this paper, we first describe the structure and algorithms of a wavelet-based video codec called CRC-WVC. We then present its fast implementation using SIMD (Single Instruction Multiple Data). Finally, we show the compression efficiency of the codec and explain the reasons why it can potentially outperform H.264. STRUCTURE OF THE VIDEO CODEC In wavelet-based video coding, a video sequence is divided into groups of pictures (GOP) and each GOP is encoded independently. The motion within each GOP is estimated. A three-dimensional (3-D) wavelet transform is applied to the GOP to remove temporal and spatial redundancies. The resulting wavelet coefficients are then encoded using an entropy coding technique. Figure 1 shows the structure of our wavelet-based video codec, CRC-WVC, where MCTF stands for motion-compensated temporal filtering. In fact, the MCTF is a one-dimensional (1-D) multi-level wavelet transform along motion trajectories. It removes the temporal redundancy within each GOP. The wavelet transform itself is a recursive filtering and subsampling process. The GOP to be transformed is first filtered using a low-pass and a highpass wavelet filter, respectively, along motion trajectories. The outputs of the filters are then sub-sampled by a factor of two, resulting in pictures of low-pass and high-pass wavelet coefficients, as shown in Figure 2. The filtering and sub-sampling process is then repeatedly applied to the resulting low-pass pictures to produce a multi-level transform. In CRC-WVC, the 5/3 wavelet filters are employed for the MCTF. A new signal extension method and a sub-sampling rule presented in (7) are used to improve the performance of the MCTF. The motion required for the MCTF is estimated using a hierarchical variable size blockmatching algorithm (4). The resulting motion vectors are predicted with an efficient algorithm and the prediction errors are encoded with the arithmetic coding. After the MCTF, every resulting picture undergoes a 2-D multi-level wavelet transform to remove the spatial redundancy existing in the picture. This 2-D transform is also performed through 1-D filtering and sub-sampling. The picture is filtered, first along each column, using the low-pass and high-pass 9/7 wavelet filters. The outputs of these filters are sub-sampled by discarding every other row, resulting in low-pass coefficients L and high-pass coefficients H. Then the picture consisting of the L and H coefficients is filtered and sub-sampled along each row using the same filters and sub-sampling rule. This results in four different groups of coefficients, LL1, LH1, HL1, and HH1 coefficients. Each of the four groups of coefficients is arranged to form a sub-picture, as shown in Figure 3. The filtering and sub-sampling process is repeatedly applied to the LL1 sub-picture to produce a multi-level 2-D wavelet transform. After the temporal and spatial wavelet transform, the signal energy of the GOP is concentrated to the low-pass coefficients at the highest temporal and highest spatial level. Figure 1. Block diagram of the wavelet-based video codec, CRC-WVC MCTF 2-D spatial wavelet transform Motion estimation GOP Code Motion coding Wavelet coefficients coding Figure 2. The first MCTF level. Figure 3. Spatial wavelet transform, (a) one level, (b) two levels. Compared with the DCT of size 4×4 or 8×8, a multi-level spatial wavelet transform is more effective in removing spatial redundancy, especially for video with high spatial resolution which exhibits a strong correlation between pixels that are more than eight pixels apart. This correlation cannot be removed by the DCT. However, a 5-level 2-D wavelet transform, for example, can well exploit the correlation between pixels that are up to 32 pixels apart. The coefficients of the wavelet transform are quantized and encoded using an extended algorithm of the well-known set partitioning in hierarchical trees (SPIHT) (8). SPIHT considers a wavelet coefficients picture as consisting of several bit planes, and progressively encodes each bit plane from the most significant one to the least significant one. Within each plane, it exploits the self-similarity of the coefficients across different wavelet bands using a zero tree structure. This progressive encoding and decoding provides extremely fine quality scalability. LL2