SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support

This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).

[1]  L E Humes,et al.  Models of the effects of threshold on loudness growth and summation. , 1991, The Journal of the Acoustical Society of America.

[2]  B Hagerman,et al.  Efficient adaptive methods for measuring speech reception threshold in quiet and in noise. , 1995, Scandinavian audiology.

[3]  Uri Hadar,et al.  Kinematics of head movements accompanying speech during conversation , 1983 .

[4]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[5]  Kåre Sjölander,et al.  An HMM-based system for automatic segmentation and alignment of speech , 2003 .

[6]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[7]  Narada D. Warakagoda,et al.  A Noise Robust Multilingual Reference Recogniser Based on Speechdat(II) , 2000, INTERSPEECH.

[8]  Jonas Beskow,et al.  Rule-based visual speech synthesis , 1995, EUROSPEECH.

[9]  Giampiero Salvi Truncation error and dynamics in very low latency phonetic recognition , 2003, NOLISP.

[10]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[11]  L E Humes,et al.  Modeling sensorineural hearing loss. I. Model and retrospective evaluation. , 1988, The Journal of the Acoustical Society of America.

[12]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[13]  Thomas S. Huang,et al.  Real time speech driven facial animation using formant analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[14]  Jonas Beskow,et al.  Picture my voice: Audio to visual speech synthesis using artificial neural networks , 1999, AVSP.

[15]  Daniel Thalmann,et al.  Models and Techniques in Computer Animation , 2014, Computer Animation Series.

[16]  Gavin C. Cawley,et al.  Near-videorealistic synthetic talking faces: implementation and evaluation , 2004, Speech Commun..

[17]  Jonas Beskow Trainable Articulatory Control Models for Visual Speech Synthesis , 2004, Int. J. Speech Technol..

[18]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[19]  Satoshi Nakamura,et al.  Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm , 2001, J. VLSI Signal Process..

[20]  Giampiero Salvi,et al.  Using HMMs and ANNs for mapping acoustic to visual speech , 1999 .

[21]  Jeffery A. Jones,et al.  Visual Prosody and Speech Intelligibility , 2004, Psychological science.

[22]  Jonas Beskow,et al.  Animation of talking agents , 1997, AVSP.

[23]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[24]  Q. Summerfield,et al.  Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. , 1985, The Journal of the Acoustical Society of America.

[25]  Parke,et al.  Parameterized Models for Facial Animation , 1982, IEEE Computer Graphics and Applications.

[26]  Junichi Yamagishi,et al.  Speech-driven lip motion generation with a trajectory HMM , 2008, INTERSPEECH.

[27]  Nobuhiko Kitawaki,et al.  Pure Delay Effects on Speech Quality in Telecommunications , 1991, IEEE J. Sel. Areas Commun..

[28]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[29]  Kjell Elenius,et al.  Experiences from Collecting Two Swedish Telephone Speech Databases , 2000, Int. J. Speech Technol..

[30]  J. Beskow Integration of an Animated Talking Face Model in a Portable Device for Multimodal Speech Synthesis , 2008 .

[31]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[32]  Björn Granström,et al.  Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents , 2007, COST 2102 Workshop.

[33]  Hervé Bourlard,et al.  Continuous speech recognition by connectionist statistical methods , 1993, IEEE Trans. Neural Networks.

[34]  Hugo Van hamme,et al.  Lip synchronization: from phone lattice to PCA eigen-projections using neural networks , 2008, INTERSPEECH.

[35]  Jörn Ostermann,et al.  Realistic facial animation system for interactive services , 2008, INTERSPEECH.

[36]  Giampiero Salvi Dynamic behaviour of connectionist speech recognition with strong latency constraints , 2006, Speech Commun..

[37]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[38]  Jonas Beskow,et al.  Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired , 2002 .

[39]  Walt Jesteadt Modeling sensorineural hearing loss , 1997 .

[40]  Jindrich Matousek,et al.  Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis , 2006, Signal Process..

[41]  Jens Edlund,et al.  Pushy versus meek - using avatars to influence turn-taking behaviour , 2007, INTERSPEECH.