Learning and Verification of Names with Multimodal User ID in Dialog

Acquiring new knowledge is a key functionality for humanoid robots. By envisioning a robot that can provide personalized services the system needs to detect, recognize and memorize information about specific persons. Recent work al- ready shows promising results in the area of speech recognition, voice identification and face identification that enable a system to reliably detect and recognize persons, as well as approaches to interactively learn to know new persons in dialog acquiring their names and ID information. One problem in this area is verification, namely to detect which person is known versus which person is unknown; a second problem is the learning phase, namely to learn the name of a person and store it in a database with associated face and voice classifier information. This paper presents work to interactively acquire ID information, combining both of the above problems into one learning dialog. In dialog we combine multimodal input including spoken name recognition, name pronunciation (phoneme recognition), name spelling (grapheme representation), face identification and voice identification and seek to build dialogs optimized to verify or learn a person's name and ID. For designing and training of optimized dialogs we use a reinforcement learning approach and propose a mul- timodal simulation modeling the user's actions and multimodal ID recognition components including stochastic error models. I. INTRODUCTION In this paper we present work on learning names and person ID information in a multimodal dialog system for a humanoid robot. One part of the dialogs that can be con- ducted with the robot are dialogs to identify and especially to learn to know new persons. We have conducted experiments with a receptionist scenario, where one task of the robot receptionist was to identify the visiting person or learn the name of the person if unknown. In the following we present efforts on especially this task namely isolated identification dialogs within the receptionist scenario. These dialogs fulfill two purposes: In case the person is known, confirm the name of the person. In case the person is unknown, classify the person as unknown and conduct a learning dialog to obtain the person's name. The presented experiments make use of standard per- ceptual components available on a humanoid robot. These components are visual perception with a stereo camera and acoustic perception with distant and close-talk microphones. Visual perception provides face detection and identification. Acoustic perception provides voice identification and speech recognition including name recognition, spelling and pho- netic understanding. These components provide recognition hypotheses which are interpreted by the dialog manager. The challenge of this task is to define a dialog strategy, including when to confirm ID information, when to ask for name pronunciation or spelling. With the goal of optimizing dialogs regarding success, length, and subjective measures, we have implemented a reinforcement learning approach which combines both verification and learning into one dia- log integrating the multiple input modalities presented above. For achieving this goal, we implemented a first rule based dialog strategy, and later a reinforcement learning strategy, which was trained in a multimodal user simulation. In the following we present the setup for multimodal integration in dialog, definition of the handcrafted strategy and learning of dialog strategies in the multimodal user simulation. Both dialog strategies are evaluated within the simulation and are compared against each other. First results from a real user experiment are reported.

[1]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  Roberto Pieraccini,et al.  User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3]  Stephanie Seneff,et al.  Two-pass strategy for handling OOVs in a large vocabulary recognition task , 2005, INTERSPEECH.

[4]  Jaime G. Carbonell,et al.  Towards a Self-Extending Parser , 1979, ACL.

[5]  Daniel Jurafsky,et al.  Have we met? MDP based speaker ID for robot dialogue , 2006, INTERSPEECH.

[6]  Stephanie Seneff,et al.  A dynamic vocabulary spoken dialogue interface , 2004, INTERSPEECH.

[7]  I. Lee Hetherington A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding , 1995 .

[8]  Qin Jin,et al.  ISL Person Identification Systems in the CLEAR Evaluations , 2006, CLEAR.

[9]  Rama Chellappa,et al.  A system identification approach for video-based face recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[10]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[11]  James L. Flanagan,et al.  Adaptive dialog based upon multimodal language acquisition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[12]  Alexander H. Waibel,et al.  Behavior models for learning and receptionist dialogs , 2007, INTERSPEECH.

[13]  Michael Meyer,et al.  Recognition of spoken and spelled proper names , 1997, EUROSPEECH.

[14]  Edward C. Kaiser,et al.  Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations , 2006, ICMI '06.

[15]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[16]  Alexander H. Waibel,et al.  Rapid simulation-driven reinforcement learning of multimodal dialog strategies in human-robot interaction , 2006, INTERSPEECH.

[17]  Stephanie Seneff,et al.  Integrating speech with keypad input for automatic entry of spelling and pronunciation of new words , 2002, INTERSPEECH.

[18]  Olivier Pietquin,et al.  ASR system modeling for automatic evaluation and optimization of dialogue systems , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  James R. Glass,et al.  Unsupervised Word Acquisition from Speech using Pattern Discovery , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Sheryl R. Young,et al.  Learning New Words from Spontaneous Speech: A Project Summary , 1993 .

[21]  Alexander H. Waibel,et al.  Dictionary learning for spontaneous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[22]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Alex Waibel,et al.  Integrating Face-ID into an Interactive Person-ID Learning System , 2007 .