Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm

In this paper, we investigate a Hidden Markov Model (HMM)-based method to drive a lip movement sequence with input speech. In a previous study, we have already investigated a mapping method based on the Viterbi decoding algorithm which converts an input speech signal to a lip movement sequence through the most likely HMM state sequence using audio HMMs. However, the method can result in errors due to incorrectly decoded HMM states. This paper proposes a method to re-estimate visual parameters using HMMs of audio-visual joint probability using the Expectation-Maximization (EM) algorithm. In the experiments, the proposed mapping method results in a 26% error reduction when compared to the Viterbi-based algorithm at incorrectly decoded bilabial consonants.

[1]  Tsuhan Chen,et al.  Cross-modal prediction in audio-visual communication , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[3]  Thierry Guiard-Marigny,et al.  3D Models of the Lips and Jaw for Visual Speech Synthesis , 1997 .

[4]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[5]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[6]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[7]  Homer H. Chen,et al.  Speech recognition for image animation and coding , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  F. Lavagetto,et al.  Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[9]  Tsuhan Chen,et al.  Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Keith Waters,et al.  Driving synthetic mouth gestures: phonetic recognition for faceme! , 1997, EUROSPEECH.

[11]  Jörn Ostermann,et al.  Natural and synthetic video in MPEG-4 , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Satoshi Nakamura,et al.  Speech to lip movement synthesis by HMM , 1997, AVSP ...