Photo-real lips synthesis with trajectory-guided sample selection

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. An audio-visual database is used to train a statistical Hidden Markov Model (HMM) of lips movement first and the trained model is then used to generate a visual parameter trajectory of lips movement for given speech signals, all in the maximum likelihood sense. The HMM generated trajectory is then used as a guide to select, in the original training database, an optimal sequence of mouth images which are then stitched back to a background head video. The whole procedure is fully automatic and data driven. With an audio/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge, which was perceptually evaluated by recruited human subjects. http://www.lips2008.org/

[1]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[4]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[6]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[7]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[8]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[10]  Hans Peter Graf,et al.  Photo-Realistic Talking-Heads from Image Samples , 2000, IEEE Trans. Multim..

[11]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[12]  Satoshi Nakamura,et al.  Statistical multimodal integration for audio-visual speech processing , 2002, IEEE Trans. Neural Networks.

[13]  Hans Peter Graf,et al.  Triphone based unit selection for concatenative visual speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[15]  Toshio Hirai,et al.  Using 5 ms segments in concatenative speech synthesis , 2004, SSW.

[16]  Gavin C. Cawley,et al.  Near-videorealistic synthetic talking faces: implementation and evaluation , 2004, Speech Commun..

[17]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[18]  Tsuhan Chen,et al.  Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition , 2005, IEEE Transactions on Multimedia.

[19]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Jörn Ostermann,et al.  Parameterization of Mouth Images by LLE and PCA for Image-Based Facial Animation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Zhenhua Ling HMM-based Unit Selection Using F , 2006 .

[22]  Lei Xie,et al.  Speech Animation Using Coupled Hidden Markov Models , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[23]  Harry Shum,et al.  Real-Time Bayesian 3-D Pose Tracking , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[25]  Jörn Ostermann,et al.  Realistic facial animation system for interactive services , 2008, INTERSPEECH.

[26]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[27]  Hichem Sahli,et al.  Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis , 2008, MLMI.

[28]  Zhi-Jie Yan,et al.  RIch-context Unit Selection (RUS) approach to high quality TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  J. P. Lewis Fast Normalized Cross-Correlation , 2010 .