Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis

In this paper, motivated by the continuously increasing presence of intelligent agents in everyday life, we address the problem of expressive photorealistic audio-visual speech synthesis, with a strong focus on the visual modality. Emotion constitutes one of the main driving factors of social life and it is expressed mainly through facial expressions. Synthesis of a talking head capable of expressive audio-visual speech is challenging due to the data overhead that arises when considering the vast number of emotions we would like the talking head to express. In order to tackle this challenge, we propose the usage of two methods, namely Hidden Markov Model (HMM) adaptation and interpolation, with HMMs modeling visual parameters via an Active Appearance Model (AAM) of the face. We show that through HMM adaptation we can successfully adapt a “neutral” talking head to a target emotion with a small amount of adaptation data, as well as that through HMM interpolation we can robustly achieve different levels of intensity for an emotion.

[1]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[2]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[3]  Petros Maragos,et al.  Adaptive and constrained algorithms for inverse compositional Active Appearance Model fitting , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Mark J. F. Gales,et al.  Photo-realistic expressive text to talking head synthesis , 2013, INTERSPEECH.

[6]  Lianhong Cai,et al.  Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar , 2006, INTERSPEECH.

[7]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[8]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  A. Mehrabian,et al.  Inference of attitudes from nonverbal communication in two channels. , 1967, Journal of consulting psychology.

[10]  P. Ekman,et al.  Facial signs of emotional experience. , 1980 .

[11]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[12]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[13]  Xu Li,et al.  Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data , 2016, INTERSPEECH.

[14]  J. Gower Generalized procrustes analysis , 1975 .

[15]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[17]  Vassilios Digalakis,et al.  Speaker adaptation using combined transformation and Bayesian methods , 1996, IEEE Trans. Speech Audio Process..

[18]  R. Plutchik,et al.  Emotion: Theory, Research, and Experience. Vol. 1. Theories of Emotion , 1981 .

[19]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[20]  C. W. Hughes Emotion: Theory, Research and Experience , 1982 .

[21]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Junichi Yamagishi,et al.  HMM-BASED EXPRESSIVE SPEECH SYNTHESIS — TOWARDS TTS WITH ARBITRARY SPEAKING STYLES AND EMOTIONS , 2003 .

[23]  Koichi Shinoda,et al.  Acoustic modeling based on the MDL principle for speech recognition , 1997, EUROSPEECH.

[24]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[25]  Joseph Bates,et al.  The role of emotion in believable agents , 1994, CACM.

[26]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.