Expressive Visual Text-to-Speech Using Active Appearance Models

This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a 'talking head', given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.

[1]  Edward Courtney,et al.  2 = 4 M , 1993 .

[2]  Timothy F. Cootes,et al.  Statistical models of face images - improving specificity , 1998, Image Vis. Comput..

[3]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[4]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[5]  Jörn Ostermann,et al.  User evaluation: Synthetic talking faces for interactive services , 1999, The Visual Computer.

[6]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[7]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  A. ADoefaa,et al.  ? ? ? ? f ? ? ? ? ? , 2003 .

[9]  Michael J. Black,et al.  Robust parameterized component analysis: theory and applications to 2D facial appearance models , 2003, Comput. Vis. Image Underst..

[10]  Towards perceptually realistic talking heads: models, methods and McGurk , 2004, APGV '04.

[11]  Gavin C. Cawley,et al.  Near-videorealistic synthetic talking faces: implementation and evaluation , 2004, Speech Commun..

[12]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[13]  Thoms M. Levergood,et al.  DECface: A system for synthetic face applications , 1995, Multimedia Tools and Applications.

[14]  Hans-Peter Seidel,et al.  Mixed feelings: expression of non-basic emotions in a muscle-based talking head , 2005, Virtual Reality.

[15]  Tony Ezzat,et al.  Transferable videorealistic speech animation , 2005, SCA '05.

[16]  Ronald Fedkiw,et al.  Eurographics/ Acm Siggraph Symposium on Computer Animation (2006) Simulating Speech with a Physics-based Facial Muscle Model , 2022 .

[17]  Fernando De la Torre,et al.  Bilinear Active Appearance Models , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Li Zhang,et al.  Dynamic, expressive speech animation from a single mesh , 2007, SCA '07.

[20]  Pieter Peers,et al.  Facial performance synthesis using deformation-driven polynomial displacement maps , 2008, SIGGRAPH Asia '08.

[21]  Fernando De la Torre,et al.  Emphatic Visual Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Xuelong Li,et al.  A Review of Active Appearance Models , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[23]  Salil Deena,et al.  Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model , 2010, ICMI-MLMI '10.

[24]  Derek Bradley,et al.  High resolution passive facial performance capture , 2010, ACM Trans. Graph..

[25]  Frank K. Soong,et al.  Photo-real lips synthesis with trajectory-guided sample selection , 2010, SSW.

[26]  Jörn Ostermann,et al.  Realistic facial expression synthesis for an image-based talking head , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[27]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[28]  Fernando De la Torre,et al.  Interactive region-based linear 3D face models , 2011, SIGGRAPH 2011.

[29]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[31]  Mark J. F. Gales,et al.  Speech factorization for HMM-TTS based on cluster adaptive training , 2012, INTERSPEECH.