论文信息 - Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm

Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm

In this paper, we investigate a Hidden Markov Model (HMM)-based method to drive a lip movement sequence with input speech. In a previous study, we have already investigated a mapping method based on the Viterbi decoding algorithm which converts an input speech signal to a lip movement sequence through the most likely HMM state sequence using audio HMMs. However, the method can result in errors due to incorrectly decoded HMM states. This paper proposes a method to re-estimate visual parameters using HMMs of audio-visual joint probability using the Expectation-Maximization (EM) algorithm. In the experiments, the proposed mapping method results in a 26% error reduction when compared to the Viterbi-based algorithm at incorrectly decoded bilabial consonants.

Satoshi Nakamura | Eli Yamamoto

[1] Tsuhan Chen,et al. Cross-modal prediction in audio-visual communication , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2] B. Walden,et al. Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[3] Thierry Guiard-Marigny,et al. 3D Models of the Lips and Jaw for Visual Speech Synthesis , 1997 .

[4] L. Baum,et al. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[5] Hiroshi Harashima,et al. A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[6] Satoshi Nakamura,et al. Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[7] Homer H. Chen,et al. Speech recognition for image animation and coding , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8] F. Lavagetto,et al. Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[9] Tsuhan Chen,et al. Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Keith Waters,et al. Driving synthetic mouth gestures: phonetic recognition for faceme! , 1997, EUROSPEECH.

[11] Jörn Ostermann,et al. Natural and synthetic video in MPEG-4 , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12] Satoshi Nakamura,et al. Speech to lip movement synthesis by HMM , 1997, AVSP ...