论文信息 - Subjective evaluation of a synthetic talking face in an acoustically noisy environment

Subjective evaluation of a synthetic talking face in an acoustically noisy environment

The realization of an anthropomorphic agent which looks like a real human is an important research topic for the broadening of the range of human-to-human communications through the use of a computer. We have proposed a technique for synthesizing natural talking-face animation that permits such communications. How to evaluate the performance of talking-face animation, however, has remained an outstanding issue. The performance of talking-face animation is determined in three parameters: (1) Does it reproduce human talking to an extent that permits lipreading? (2) Does it appear visually natural? (3) Is it accurately synchronized with voice? In this paper, we first presented talking-face animation along with the voice to subjects and conducted experiments on how well the subjects heard the contents of the spoken words to examine Parameter (1). In the next step, with regard to Parameter (2), the visual naturalness of the talking-face animation and the smoothness of the motion of the talking mouth were evaluated on a scale of 5 points. Lastly, with regard to Parameter (3), talking-face animation in which the synchronization of the animation with sound was off by a fixed interval was shown to subjects to investigate the subjective perception of the synchronization gap, and the extent of the resulting strange feeling was evaluated on a scale of 5 points. In addition, the effect of the synchronization gap between voice and talking-face animation on the manner in which the spoken words are understood was also evaluated. Through these evaluation experiments, the quality of synthetic talking-face animation proposed by the authors was evaluated, and we studied naturally-appearing synchronization between synthetic talking-face animation and voice. © 2006 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 89(5): 39–52, 2006; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.20180

Satoshi Nakamura | Shigeo Morishima | Akinobu Maejima | Tatsuo Yotsukura

[1] Steve Young,et al. The HTK book , 1995 .

[2] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[3] Satoshi Nakamura,et al. Subjective Evaluation for HMM-Based Speech-To-Lip Movement Synthesis , 1998, AVSP.

[4] Tony Ezzat,et al. Face analysis for the synthesis of photo-realistic talking heads , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[5] Tony Ezzat,et al. Trainable videorealistic speech animation , 2002, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[6] Satoshi Nakamura,et al. Multi-modal translation system and its evaluation , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.