An audio-visual distance for audio-visual speech vector quantization

Speech is both an acoustic and a visual signal, and there exists some complementarity and redundancy between the two modalities. In the speech coding domain, it is of great interest to use this redundancy to improve speech coder performance. In this paper, we consider some audio and video joint coding process based on an audio-visual vector quantization. The method is shown to exploit quite well the audio-visual redundancy as it can reduce the bit rate while decreasing the quantization error. A notion of audio-visual distance has to be introduced and adapted to the different nature of the data. It is defined from an existing audio distance and a new visual distance, which is particularly focussed.

[1]  Gang Feng,et al.  A preliminary study of an audio-visual speech coder: Using video parameters to reduce an LPC vocoder bit rate , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[2]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[3]  J Robert-Ribes,et al.  Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise. , 1998, The Journal of the Acoustical Society of America.

[4]  Mohamed Tahar Lallouache,et al.  Un poste "visage-parole" couleur : acquisition et traitement automatique des contours des lèvres , 1991 .

[5]  Gang Feng,et al.  Audiovisual speech enhancement: new advances using multi-layer perceptrons , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[6]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[7]  Christian Benoît,et al.  Read my lips... and my jaw! how intelligible are the components of a speaker's face? , 1995, EUROSPEECH.

[8]  A. Gray,et al.  Distance measures for speech processing , 1976 .