Comparison of windowing in speech and audio coding

Speech and audio coding have during the last decade converged to an increasingly unified technology. This contribution discusses one of the remaining fundamental differences between speech and audio paradigms, namely, windowing of the input signal. Audio codecs generally use lapped transforms and apply a perceptual model in the transform domain, whereby temporal continuity is achieved by windowing and overlap-add. Speech codecs on the other hand achieve temporal continuity by using linear predictive filtering, whereby windowing is applied in the residual domain. Despite these fundamental differences, we demonstrate that the two windowing approaches, combined with perceptual modeling, perform very similarly both in terms of perceptual quality and theoretical properties.

[1]  Marina Bosi,et al.  Introduction to Digital Audio Coding and Standards , 2004, J. Electronic Imaging.

[2]  Philippe Gournay,et al.  Unified speech and audio coding scheme for high quality at low bitrates , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[4]  Manfred R. Schroeder,et al.  Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Roch Lefebvre,et al.  The adaptive multirate wideband speech codec (AMR-WB) , 2002, IEEE Trans. Speech Audio Process..

[6]  Louis Dunn Fielder,et al.  ISO/IEC MPEG-2 Advanced Audio Coding , 1997 .