Audio-Visual Speech Synthesis Based on Chinese Visual Triphone

A new audio-visual speech synthesis approach is proposed based on Chinese visual triphone. Chinese visual triphone model is constructed using a new clustering method combining artificial immune system and FCM. In the analysis stage, with the training phonetic transcription, visual triphone segments are selected from video sequence, and corresponding lip feature vectors are extracted. In the synthesis stage, viterbi search algorithm is used to select the best visual triphone segments by finding out a path which produces the minimum cost. According to the concatenation principles, mouth animation is generated and stitched into background video. Experimental results show that the synthesized video is natural-looking and satisfactory.

[1]  Ilse Ravyse,et al.  Accurate visual speech synthesis based on diviseme unit selection and concatenation , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[2]  Hui Zhao,et al.  Visual speech synthesis based on Chinese dynamic visemes , 2008, 2008 International Conference on Information and Automation.

[3]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Harry Shum,et al.  Learning dynamic audio-visual mapping with input-output Hidden Markov models , 2006, IEEE Trans. Multim..

[5]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[6]  Hongbin Zha,et al.  Transferring of Speech Movements from Video to 3D Face Space , 2007, IEEE Transactions on Visualization and Computer Graphics.

[7]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[8]  Tao Yu,et al.  Fast thresholding segmentation for image with high noise , 2008, International Conference on Informatics and Analytics.