A statistical parametric approach to video-realistic text-driven talking avatar

This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.

[1]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[2]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[3]  Mark Pauly,et al.  Realtime performance-based facial animation , 2011, ACM Trans. Graph..

[4]  Zhigang Deng,et al.  Data-Driven 3D Facial Animation , 2007 .

[5]  Barry-John Theobald,et al.  Animating Faces Using Appearance Models , 2007 .

[6]  Barry-John Theobald,et al.  A real-time speech-driven talking head using active appearance models , 2007, AVSP.

[7]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[8]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[9]  Yongxin Wang,et al.  Emotional Audio-Visual Speech Synthesis Based on PAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[11]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[12]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Lei Xie,et al.  Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling , 2007, IEEE Transactions on Multimedia.

[14]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[15]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[16]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[17]  Jörn Ostermann,et al.  Optimization of an Image-Based Talking Head System , 2009, EURASIP J. Audio Speech Music. Process..

[18]  Jenq-Neng Hwang,et al.  Baum-Welch hidden Markov model inversion for reliable audio-to-visual conversion , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[19]  Keiichi Tokuda,et al.  A training method for average voice model based on shared decision tree context clustering and speaker adaptive training , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Gérard Bailly,et al.  Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation , 2009, EURASIP J. Audio Speech Music. Process..

[21]  Algirdas Pakstas,et al.  MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[22]  Lianhong Cai,et al.  Head and facial gestures synthesis using PAD model for an expressive talking avatar , 2014, Multimedia Tools and Applications.

[23]  Björn Granström,et al.  SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support , 2009, EURASIP J. Audio Speech Music. Process..

[24]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[25]  Lianhong Cai,et al.  Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training , 2013, Multimedia Tools and Applications.

[26]  Lianhong Cai,et al.  Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar , 2006, INTERSPEECH.

[27]  H. Shimodaira,et al.  Carnival—Combining Speech Technology and Computer Animation , 2011, IEEE Computer Graphics and Applications.

[28]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[29]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[30]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[31]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[32]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[33]  Frank K. Soong,et al.  Synthesizing photo-real talking head via trajectory-guided sample selection , 2010, INTERSPEECH.

[34]  Abdelmajid Ben Hamadou,et al.  Off-line handwritten word recognition using multi-stream hidden Markov models , 2010, Pattern Recognit. Lett..

[35]  Junichi Yamagishi,et al.  Speech-driven lip motion generation with a trajectory HMM , 2008, INTERSPEECH.

[36]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[37]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[38]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[39]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[40]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[41]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[42]  Giampiero Salvi,et al.  Using HMMs and ANNs for mapping acoustic to visual speech , 1999 .

[43]  Tomaso A. Poggio,et al.  Reanimating Faces in Images and Video , 2003, Comput. Graph. Forum.

[44]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.