论文信息 - Photo-real lips synthesis with trajectory-guided sample selection

Photo-real lips synthesis with trajectory-guided sample selection

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. An audio-visual database is used to train a statistical Hidden Markov Model (HMM) of lips movement first and the trained model is then used to generate a visual parameter trajectory of lips movement for given speech signals, all in the maximum likelihood sense. The HMM generated trajectory is then used as a guide to select, in the original training database, an optimal sequence of mouth images which are then stitched back to a background head video. The whole procedure is fully automatic and data driven. With an audio/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge, which was perceptually evaluated by recruited human subjects. http://www.lips2008.org/

[1] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2] Keiichi Tokuda,et al. Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[4] Alex Acero,et al. Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Hans Peter Graf,et al. Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[6] Robert E. Donovan,et al. The IBM trainable speech synthesis system , 1998, ICSLP.

[7] Tony Ezzat,et al. MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[8] Keiichi Tokuda,et al. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9] Keiichi Tokuda,et al. HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[10] Hans Peter Graf,et al. Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[11] Tsuhan Chen,et al. Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[12] Satoshi Nakamura,et al. Statistical multimodal integration for audio-visual speech processing , 2002, IEEE Trans. Neural Networks.

[13] Hans Peter Graf,et al. Triphone based unit selection for concatenative visual speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14] Patrick Pérez,et al. Poisson image editing , 2003, ACM Trans. Graph..

[15] Toshio Hirai,et al. Using 5 ms segments in concatenative speech synthesis , 2004, SSW.

[16] Gavin C. Cawley,et al. Near-videorealistic synthetic talking faces: implementation and evaluation , 2004, Speech Commun..

[17] Tomaso Poggio,et al. Trainable Videorealistic Speech Animation , 2004, FGR.

[18] Tsuhan Chen,et al. Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition , 2005, IEEE Transactions on Multimedia.

[19] Keiichi Tokuda,et al. Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20] Jörn Ostermann,et al. Parameterization of Mouth Images by LLE and PCA for Image-Based Facial Animation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21] Zhenhua Ling. HMM-based Unit Selection Using F , 2006 .

[22] Lei Xie,et al. Speech Animation Using Coupled Hidden Markov Models , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[23] Harry Shum,et al. Real-Time Bayesian 3-D Pose Tracking , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[24] Heiga Zen,et al. The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[25] Jörn Ostermann,et al. Realistic facial animation system for interactive services , 2008, INTERSPEECH.

[26] Gérard Bailly,et al. LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[27] Hichem Sahli,et al. Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis , 2008, MLMI.

[28] Zhi-Jie Yan,et al. RIch-context Unit Selection (RUS) approach to high quality TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29] J. P. Lewis. Fast Normalized Cross-Correlation , 2010 .