Text-to-visual speech synthesis based on parameter generation from HMM

This paper presents a new technique for synthesizing visual speech from arbitrarily given text. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. In the training phase, syllable HMMs are trained with visual speech parameter sequences that represent lip movements. In the synthesis phase, a sentence HMM is constructed by concatenating syllable HMMs corresponding to the phonetic transcription for the input text. Then an optimum visual speech parameter sequence is generated from the sentence HMM in an ML sense. The proposed technique can generate synchronized lip movements with speech in a unified framework. Furthermore, coarticulation is implicitly incorporated into the generated mouth shapes. As a result, synthetic lip motion becomes smooth and realistic.

[1]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[3]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[4]  Satoshi Nakamura,et al.  Speech to lip movement synthesis by HMM , 1997, AVSP ...

[5]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  N. M. Brooke,et al.  Computer graphics animations of talking faces based on stochastic models , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[7]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Scott McGlashan,et al.  OLGA - a dialogue system with an animated talking agent , 1997, EUROSPEECH.

[9]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[11]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[12]  John Yen,et al.  Emotionally expressive agents , 1999, Proceedings Computer Animation 1999.

[13]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[14]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[16]  Thoms M. Levergood,et al.  DEC face: an automatic lip-synchronization algorithm for synthetic faces , 1993 .

[17]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[18]  Tsuhan Chen,et al.  Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.