Detecting robot-directed speech by situated understanding in object manipulation tasks

In this paper, we propose a novel method for a robot to detect robot-directed speech, that is, to distinguish speech that users speak to a robot from speech that users speak to other people or to themselves. The originality of this work is the introduction of a multimodal semantic confidence (MSC) measure, which is used for domain classification of input speech based on the decision on whether the speech can be interpreted as a feasible action under the current physical situation in an object manipulation task. This measure is calculated by integrating speech, object, and motion confidence with weightings that are optimized by logistic regression. Then we integrate this measure with gaze tracking and conduct experiments under conditions of natural human-robot interaction. Experimental results show that the proposed method achieves a high performance of 94% and 96% in average recall and precision rates, respectively, for robot-directed speech detection.

[1]  Tetsuya Takiguchi,et al.  System Request Utterance Detection Based on Acoustic and Linguistic Features , 2008 .

[2]  Satoshi Nakamura,et al.  Sequential Non-Stationary Noise Tracking Using Particle Filtering with Switching Dynamical System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Naoto Iwahashi,et al.  Robots That Learn Language: A Developmental Approach to Situated Human-Robot Conversations , 2007 .

[4]  Takio Kurita,et al.  Iterative weighted least squares algorithms for neural networks classifiers , 1992, New Generation Computing.

[5]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[6]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[7]  Hirotake Yamazoe,et al.  Evaluating crossmodal awareness of daily-partner robot to user's behaviors with gaze and utterance detection , 2009, CASEMANS@Pervasive.

[8]  Sebastian Lang,et al.  Providing the basis for human-robot-interaction: a multi-modal attention system for a mobile robot , 2003, ICMI '03.

[9]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Kiyohiro Shikano,et al.  Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs , 2004, INTERSPEECH.

[12]  Satoru Hayamizu,et al.  A spoken dialog system for a mobile office robot , 1999, EUROSPEECH.