Maximising audio-visual speech correlation

The aim of this work is to investigate a selection of audio and visual speech features with the aim of finding pairs that maximise audio-visual correlation. Two audio speech features have been used in the analysis - filterbank vectors and the first four formant frequencies. Similarly, three visual features have also been considered - active appearance model (AAM), 2-D DCT and cross-DCT. From a database of 200 sentences, audio and visual speech features have been extracted and multiple linear regression used to measure the audio-visual correlation. Results reveal filterbank features to exhibit multiple correlation of around R=0.8 to visual features, while formant frequencies show substantially less correlation to visual features - R=0.6 for formants 1 and 2 and less than R=0.4 for formants 3 and 4. The three visual features show almost identical correlation to the audio features, varying in multiple correlation by less than 0.1, even though the methods of visual feature extraction are very different. Measuring the audio-visual correlation within each phoneme and then averaging the correlation across all phonemes showed an increase in correlation to R=0.9.

[1]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  S. Chatterjee,et al.  Regression Analysis by Example , 1979 .

[3]  Qin Yan,et al.  A formant tracking LP model for speech processing , 2004, INTERSPEECH.

[4]  Ben P. Milner,et al.  Noisy audio speech enhancement using Wiener filters derived from visual speech , 2007, AVSP.

[5]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[6]  Barry-John Theobald,et al.  Visual speech synthesis using shape and appearance models , 2003 .

[7]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[8]  Georg Meyer,et al.  Continuous audio-visual digit recognition using decision fusion , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Ricardo Gutierrez-Osuna,et al.  Speech-driven facial animation with realistic dynamics , 2005, IEEE Transactions on Multimedia.

[10]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..