Evaluation of formant-based lip motion generation in tele-operated humanoid robots

Generating natural motion in robots is important for improving human-robot interaction. We developed a tele-operation system where the lip motion of a remote humanoid robot is automatically controlled from the operator's voice. In the present work, we introduce an improved version of our proposed speech-driven lip motion generation method, where lip height and width degrees are estimated based on vowel formant information. The method requires the calibration of only one parameter for speaker normalization. Lip height control is evaluated in two types of humanoid robots (Telenoid-R2 and Geminoid-F). Subjective evaluation indicated that the proposed audio-based method can generate lip motion with naturalness superior to vision-based and motion capture-based approaches. Partial lip width control was shown to improve lip motion naturalness in Geminoid-F, which also has an actuator for stretching the lip corners. Issues regarding online real-time processing are also discussed.

[1]  Hiroshi Ishiguro,et al.  Speech-driven lip motion generation for tele-operated humanoid robots , 2011, AVSP.

[2]  Ingo R. Titze,et al.  Principles of voice production , 1994 .

[3]  Giampiero Salvi Dynamic behaviour of connectionist speech recognition with strong latency constraints , 2006, Speech Commun..

[4]  Frank K. Soong,et al.  A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion , 2010, INTERSPEECH.

[5]  Korin Richmond,et al.  Comparison of HMM and TMDN methods for lip synchronisation , 2010, INTERSPEECH.

[6]  Jonas Beskow,et al.  Data-driven synthesis of expressive visual speech using an MPEG-4 talking head , 2005, INTERSPEECH.

[7]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[8]  Junichi Yamagishi,et al.  Speech-driven lip motion generation with a trajectory HMM , 2008, INTERSPEECH.

[9]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[10]  György Takács Direct, modular and hybrid audio to visual speech conversion methods - a comparative study , 2009, INTERSPEECH.

[11]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Thomas S. Huang,et al.  Real-time speech-driven face animation with expressions using neural networks , 2002, IEEE Trans. Neural Networks.

[13]  Alan Wee-Chung Liew,et al.  Statistical correlation analysis between lip contour parameters and formant parameters for Mandarin monophthongs , 2008, AVSP.