Detecting Robot-Directed Speech by Situated Understanding in Physical Interaction

Summary In this paper, we propose a novel method for a robot to detect robot-directed speech: to distinguishspeech that users speak to a robot from speech that users speak to other people or to themselves. The originality of this work is the introduction of a multimodal semantic confidence (MSC) measure, which is used for domain classification of input speech based on the decision on whether the speech can be interpreted as a feasible action under the current physical situation in an object manipulation task. This measure is calculated by integrating speech, object, and motion confidence with weightings that are optimized by logistic regression. Then we integrate this measure with gaze tracking and conduct experiments under conditions of natural human-robot interactions. Experimental results show that the proposed method achieves a high performance of 94% and 96% in average recall and precision rates, respectively, for robot-directed speech detection.

[1]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Satoshi Nakamura,et al.  Robust Speech Recognition System for Communication Robots in Real Environments , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[3]  S. Haykin,et al.  Pattern Recognition Using a Family of Design Algorithms Based upon the Generalized Probabilistic Descent Method , 2001 .

[4]  Takio Kurita,et al.  Iterative weighted least squares algorithms for neural networks classifiers , 1992, New Generation Computing.

[5]  Naoto Iwahashi,et al.  Robots That Learn Language: A Developmental Approach to Situated Human-Robot Conversations , 2007 .

[6]  Takayuki Kanda,et al.  Footing in human-robot conversations: How robots might shape participant roles using gaze cues , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[7]  Tetsuya Takiguchi,et al.  System Request Utterance Detection Based on Acoustic and Linguistic Features , 2008 .

[8]  Hirotake Yamazoe,et al.  Evaluating crossmodal awareness of daily-partner robot to user's behaviors with gaze and utterance detection , 2009, CASEMANS@Pervasive.

[9]  Sebastian Lang,et al.  Providing the basis for human-robot-interaction: a multi-modal attention system for a mobile robot , 2003, ICMI '03.

[10]  Chin-Hui Lee,et al.  Speaking-style dependent lexicalized filler model for key-phrase detection and verification , 1997, ICSLP.

[11]  Satoshi Nakamura,et al.  Sequential Non-Stationary Noise Tracking Using Particle Filtering with Switching Dynamical System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Kiyohiro Shikano,et al.  Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs , 2004, INTERSPEECH.

[13]  Satoru Hayamizu,et al.  A spoken dialog system for a mobile office robot , 1999, EUROSPEECH.

[14]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[16]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .