Expressive Visual Speech Generation

With the emergence of 3D graphics, we are now able to create very realistic 3D characters that can move and talk. Multimodal interaction with such characters is also possible, as various technologies have matured for speech and video analysis, natural language dialogues, and animation. However, the behavior expressed by these characters is far from believable in most systems. We feel that this problem arises due to their lack of individuality on various levels: perception, dialogue, and expression. In this chapter, we describe results of research that tries to realistically connect personality and 3D characters, not only on an expressive level (for example, generating individualized expressions on a 3D face), but also with real-time video tracking, on a dialogue level (generating responses that actually correspond to what a certain personality in a certain emotional state would say) and on a perceptive level (having a virtual character that uses expression user data to create corresponding behavior). The idea of linking personality with agent behavior has been discussed by Marsella et al. [33], with the influence of emotion on behavior in general, and Johns et al. [21] with how personality and emotion can affect decision making. Traditionally, any text or voice-driven speech animation system uses the phonemes as the basic units of speech, and visemes as the basic units of animation. Though text-to-speech synthesizers and phoneme recognizers often use biphonebased techniques, the end user seldom has access to this information, except for dedicated systems. Most commercially and freely available software applications allow access to only time-stamped phoneme streams along with audio. Thus, in order to generate animation from this information, an extra level of processing, namely co-articulation, is required. This process takes care of the influence of the neighboring visemes for fluent speech production. This processing stage can be eliminated by using the syllable as a basic unit of speech rather than the phoneme. Overall, we do not intend to give a complete survey of ongoing research in behavior, emotion, and personality. Our main goal is to create believable conversational agents that can interact with many modalities. We thus concentrate on emotion extraction of a real user (Section 2.3), visyllable-based speech animation (Section 2.4), dialogue systems and emotions (Section 2.5).

[1]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[2]  George Anton Kiraz,et al.  Multilingual syllabification using weighted finite-state transducers , 1998, SSW.

[3]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Nadia Magnenat-Thalmann,et al.  Visyllable Based Speech Animation , 2003, Comput. Graph. Forum.

[5]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[6]  Nadia Magnenat-Thalmann,et al.  MPEG-4 based animation with face feature tracking , 1999, Computer Animation and Simulation.

[7]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Sumedha Kshirsagar,et al.  A multilayer personality model , 2002, SMARTGRAPH '02.

[9]  J. M. Digman PERSONALITY STRUCTURE: EMERGENCE OF THE FIVE-FACTOR MODEL , 1990 .

[10]  Pierre Poulin,et al.  Real-time facial animation based upon a bank of 3D facial expressions , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[11]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[12]  Rosalind W. Picard Affective Computing , 1997 .

[13]  Hans Peter Graf,et al.  Triphone based unit selection for concatenative visual speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  J. E. Ball,et al.  Emotion and Personality in a Conversational Character , 1998 .

[15]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[16]  Dimitris N. Metaxas,et al.  Optical Flow Constraints on Deformable Models with Applications to Face Tracking , 2000, International Journal of Computer Vision.

[17]  A. C. Gimson,et al.  An introduction to the pronunciation of English , 1991 .

[18]  P. Ekman Emotion in the human face , 1982 .

[19]  Rhys James Jones,et al.  Continuous speech recognition using syllables , 1997, EUROSPEECH.

[20]  Andrew Ortony,et al.  The Cognitive Structure of Emotions , 1988 .

[21]  B. Silverman,et al.  How Emotions and Personality Effect the Utility of Alternative Decisions: A Terrorist Target Selection Case Study , 2001 .

[22]  Nadia Magnenat-Thalmann,et al.  Principal components of expressive speech animation , 2001, Proceedings. Computer Graphics International 2001.

[23]  Juan David Velásquez,et al.  Modeling Emotions and Other Motivations in Synthetic Agents , 1997, AAAI/IAAI.

[24]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[25]  Ronan Boulic,et al.  Standards for Virtual Humans , 2006 .

[26]  Luc Van Gool,et al.  Face animation based on observed 3D speech dynamics , 2001, Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No.01TH8596).

[27]  R. McCrae,et al.  An introduction to the five-factor model and its applications. , 1992, Journal of personality.

[28]  Mark Steedman,et al.  Generating Facial Expressions for Speech , 1996, Cogn. Sci..

[29]  Xuan Zhang,et al.  Emotional Communication with Virtual Humans , 2003, MMM.

[30]  Joseph Picone,et al.  Advances in alphadigit recognition using syllables , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[31]  Takeo Kanade,et al.  Automated facial expression recognition based on FACS action units , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[32]  N. Magnenat-Thalmann,et al.  Automatic face cloning and animation using real-time facial feature tracking and speech acquisition , 2001, IEEE Signal Process. Mag..

[33]  Stacy Marsella,et al.  A step toward irrationality: using emotion to change belief , 2002, AAMAS '02.

[34]  Daniel Thalmann,et al.  Towards Natural Communication in Networked Collaborative Virtual Environments , 1996 .

[35]  Thomas Rist,et al.  Integrating Models of Personality and Emotions into Lifelike Characters , 1999, IWAI.

[36]  John Yen,et al.  PETEEI: a PET with evolving emotional intelligence , 1999, AGENTS '99.

[37]  Nadia Magnenat-Thalmann,et al.  Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation , 2000, DEFORM/AVATARS.