Time-delay neural networks for estimating lip movements from speech analysis: a useful tool in audio-video synchronization

A new technology is proposed for audio-video synchronization in multimedia applications where talking human faces, either natural or synthetic, are employed for interpersonal communication services, home gaming, advanced multimodal interfaces, interactive entertainment, or in movie production. Facial sequences, in fact, represent an acoustic-visual source characterized by two strongly correlated components: a talking face and the associated speech, whose synchronous presentation must be guaranteed in any multimedia application. Therefore, the exact timing for displaying a video frame or for generating a synthetic facial image has to be supervised by some form of speech analysis performed either as preprocessing before encoding or as postprocessing before presentation. Experimental results are reported on the use of time-delay neural networks (TDNN) for the direct estimation of the visible articulation of the mouth starting from the coherent analysis of acoustic speech. The architectural solution of employing a bank of independent single-output TDNNs has been compared to the alternative solution of using only a single multi-output TDNN. Similarly, two different learning procedures have been applied and compared for training the TDNN, the first based on the classic mean square error (MSE) and the second based on a measure of cross-correlation (CC). The superiority of the system based on multiple single-output TDNNs has been proved as well as the improvements, both in terms of convergence speed and estimation fidelity, achievable through the learning algorithm based on cross-correlation.

[1]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[2]  N. P. Erber,et al.  Auditory, visual, and auditory-visual recognition of consonants by children with normal and impaired hearing. , 1972, Journal of speech and hearing research.

[3]  Tsuhan Chen,et al.  A new frame interpolation scheme for talking head sequences , 1995, Proceedings., International Conference on Image Processing.

[4]  Osamu Fujimura Elementary Gestures and Temporal Organization — What Does an Articulatory Constraint Mean? , 1981 .

[5]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[6]  Fabio Lavagetto,et al.  Object-oriented scene modeling for interpersonal video communication at very low bit-rate , 1994, Signal Process. Image Commun..

[7]  F. Lavagetto,et al.  Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[8]  Fabio Lavagetto,et al.  MPEG-4: Audio/video and synthetic graphics/audio for mixed media , 1997, Signal Process. Image Commun..

[9]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[10]  Eric A. Wan,et al.  Temporal backpropagation for FIR neural networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[11]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[12]  H.P. Graf,et al.  Lip synchronization using speech-assisted video processing , 1995, IEEE Signal Processing Letters.

[13]  Eric D. Petajan,et al.  MPEG-4 : Audio / Video & Synthetic Graphics / Audio for Mixed Media , 1998 .

[14]  Yao Wang,et al.  Speech-assisted lip synchronization in audio-visual communications , 1995, Proceedings., International Conference on Image Processing.

[15]  Carol A. Fowler,et al.  Coarticulation and theories of extrinsic timing , 1980 .

[16]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[17]  F. Lavagetto,et al.  Time Delay Neural Networks for Articulatory Estimation from Speech: Suitable Subjective Evaluation Protocols , 1996 .

[18]  Raymond D. Kent,et al.  Coarticulation in recent speech production models , 1977 .

[19]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[20]  M. Pichora-Fuller,et al.  Coarticulation effects in lipreading. , 1982, Journal of speech and hearing research.

[21]  Tsuhan Chen,et al.  Audio visual interaction in multimedia , 1995 .

[22]  Fabio Lavagetto,et al.  Synthetic and hybrid imaging in the HUMANOID and VIDAS projects , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[23]  A. D. Brink,et al.  Minimum cross-entropy threshold selection , 1996, Pattern Recognit..

[24]  Oscar N. Garcia,et al.  Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition , 1996 .

[25]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[26]  Tsuhan Chen,et al.  Speech-assisted video processing: interpolation and low-bitrate coding , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[27]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[28]  Guy Mercier,et al.  Neural-fuzzy networks and phonetic feature recognition as a help for speechreading , 1996 .

[29]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[30]  Waibel A novel objective function for improved phoneme recognition using time delay neural networks , 1989 .

[31]  R. Hammarberg The metaphysics of coarticulation , 1976 .

[32]  Fabio Lavagetto SPEECH ASSISTED MOTION COMPENSATION IN VIDEOPHONE COMMUNICATIONS , 1996 .