Lip animation synthesis: a unified framework for speaking and laughing virtual agent

This paper proposes a unified statistical framework to synthesize speaking and laughing lip animations for virtual agents in real time. Our lip animation synthesis model takes as input the decomposition of a spoken text into phonemes as well as their duration. Our model can be used with synthesized speech. First, Gaussian mixture models (GMMs), called lip shape GMMs, are used to model the relationship between phoneme duration and lip shape from human motion capture data; then an interpolation function is learnt from human motion capture data, which is based on hidden Markov models(HMMs), called HMMs interpolation. In the synthesis step, lipshapeGMMs are used to infer a first lip shape stream from the inputs; then this lip shape stream is smoothed by the learnt HMMs interpolation, to obtain the synthesized lip animation. The effectiveness of the proposed framework is confirmed in the objective evaluation.

[1]  Jörn Ostermann,et al.  Lifelike talking faces for interactive services , 2003, Proc. IEEE.

[2]  Alan Wee-Chung Liew,et al.  A new optimization procedure for extracting the point-based lip contour using active shape model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  E. Cosatto Sample-based talking-head synthesis , 2002 .

[4]  John P. Lewis,et al.  Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces , 2006, IEEE Transactions on Visualization and Computer Graphics.

[5]  N. F. Dixon,et al.  The Detection of Auditory Visual Desynchrony , 1980, Perception.

[6]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[7]  Frank K. Soong,et al.  High quality lip-sync animation for 3D photo-realistic talking head , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Korin Richmond,et al.  Comparison of HMM and TMDN methods for lip synchronisation , 2010, INTERSPEECH.

[9]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[10]  Hui Chen,et al.  Phoneme-level articulatory animation in pronunciation training , 2012, Speech Commun..

[11]  Thierry Dutoit,et al.  Automatic Phonetic Transcription of Laughter and Its Application to Laughter Synthesis , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[12]  Frank K. Soong,et al.  HMM trajectory-guided sample selection for photo-realistic talking head , 2014, Multimedia Tools and Applications.

[13]  Algirdas Pakstas,et al.  MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[14]  Ren-Hua Wang,et al.  Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge , 2008, INTERSPEECH.

[15]  Thierry Dutoit,et al.  Evaluation of HMM-based visual laughter synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  B. N. Barman Laughing , Crying , Sneezing and Yawning : Automatic Voice Dr iven Animation of Non-Speech Articulations ∗ , 2006 .

[17]  Junichi Yamagishi,et al.  Speech-driven lip motion generation with a trajectory HMM , 2008, INTERSPEECH.

[18]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[19]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[20]  R. H. Stetson Motor phonetics : a study of speech movements in action , 1951 .

[21]  Yongxin Wang,et al.  Emotional Audio-Visual Speech Synthesis Based on PAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Gérard Bailly,et al.  Speaking with smile or disgust: data and models , 2008, AVSP.

[23]  Coskun Bayrak,et al.  Facial animation framework for web and mobile platforms , 2011, 2011 IEEE 13th International Conference on e-Health Networking, Applications and Services.

[24]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[25]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[26]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[27]  Gérard Bailly,et al.  Lip-Synching Using Speaker-Specific Articulation, Shape and Appearance Models , 2009, EURASIP J. Audio Speech Music. Process..

[28]  Frank K. Soong,et al.  Photo-real lips synthesis with trajectory-guided sample selection , 2010, SSW.

[29]  Frank K. Soong,et al.  Improved minimum converted trajectory error training for real-time speech-to-lips conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Harry Shum,et al.  Learning dynamic audio-visual mapping with input-output Hidden Markov models , 2006, IEEE Trans. Multim..

[31]  Catherine Pelachaud,et al.  Laughter animation synthesis , 2014, AAMAS.

[32]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[33]  Raymond D. Kent,et al.  Coarticulation in recent speech production models , 1977 .

[34]  M. Mancini,et al.  Speaking with Emotions , 2004 .

[35]  Elisabetta Bevacqua,et al.  Expressive audio‐visual speech , 2004, Comput. Animat. Virtual Worlds.

[36]  Thierry Dutoit,et al.  AVLaughterCycle : Enabling a virtual agent to join in laughing with a conversational partner using a similarity-driven audiovisual laughter animation (Original Paper) , 2010 .

[37]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[38]  Jörn Ostermann,et al.  Talking faces - technologies and applications , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[39]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[40]  Frank K. Soong,et al.  A real-time text to audio-visual speech synthesis system , 2008, INTERSPEECH.

[41]  Jun Yu,et al.  Synthesizing real-time speech-driven facial animation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Frank K. Soong,et al.  A new language independent, photo-realistic talking head driven by voice only , 2013, INTERSPEECH.

[43]  Frank K. Soong,et al.  Synthesizing photo-real talking head via trajectory-guided sample selection , 2010, INTERSPEECH.