On evaluating synthesised visual speech

Abstract This paper describes issues relating to the subjective evalu-ation of synthesised visual speech. Two approaches to synthe-sis are compared: a text-driven synthesiser and a speech-drivensynthesiser. Both synthesisers are trained using the same dataand both use the same model for rendering the synthesised vi-sual speech. Naturalness is used as a performance metric, andthe naturalness of real visual speech re-rendered on the samemodel is used as a benchmark. The naturalness of the text-driven synthesiser is significantly better than the speech-drivensynthesiser, but neither synthesiser can yet achieve the natu-ralness of real visual speech. The impact of likely sources oferror apparent in the synthesised visual speech is investigated.Similar forms of error are introduced into real visual speechsequences and the degradation in naturalness is measured usingthe same naturalness ratings used to evaluate the performance ofthe synthesisers. We find that the overall perception of sentence-level utterances is severely degraded when only a small regionof an otherwise perfect rendering of the visual sequence is in-correct. For example, if the visual gesture for only a singlesyllable in an utterance is incorrect, the overall naturalness ofthis real sequence is rated lower than the text-based synthesiser.Index Terms: evaluation, visual speech synthesis, active ap-pearance models

[1]  Yung-Chang Chen,et al.  Partial linear regression for speech-driven talking head application , 2006, Signal Process. Image Commun..

[2]  Barry-John Theobald,et al.  Visual speech synthesis using shape and appearance models , 2003 .

[3]  Sascha Fagel,et al.  Joint Audio-Visual Unit Selection - the JAVUS Speech Synthesizer , 2006 .

[4]  Tony Ezzat,et al.  Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[5]  Gavin C. Cawley,et al.  Near-videorealistic synthetic talking faces: implementation and evaluation , 2004, Speech Commun..

[6]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[7]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[8]  Sascha Fagel,et al.  VISUALIZATION OF INTERNAL ARTICULATOR DYNAMICS AND ITS INTELLIGIBILITY IN SYNTHETIC AUDIOVISUAL SPEECH , 2007 .

[9]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[10]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[11]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[12]  Ricardo Gutierrez-Osuna,et al.  Speech-driven facial animation with realistic dynamics , 2005, IEEE Transactions on Multimedia.

[13]  Hans Peter Graf,et al.  Triphone based unit selection for concatenative visual speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Guillaume Gibert,et al.  Evaluation of movement generation systems using the point-light technique , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[15]  Barry-John Theobald,et al.  A real-time speech-driven talking head using active appearance models , 2007, AVSP.

[16]  D. Massaro,et al.  Perceiving Talking Faces , 1995 .