Low-level grounding in a multimodal mobile service robot conversational system using graphical models

The main task of a service robot with a voice-enabled communication interface is to engage a user in dialogue providing an access to the services it is designed for. In managing such interaction, inferring the user goal (intention) from the request for a service at each dialogue turn is the key issue. In service robot deployment conditions speech recognition limitations with noisy speech input and inexperienced users may jeopardize user goal identification. In this paper, we introduce a grounding state-based model motivated by reducing the risk of communication failure due to incorrect user goal identification. The model exploits the multiple modalities available in the service robot system to provide evidence for reaching grounding states. In order to handle the speech input as sufficiently grounded (correctly understood) by the robot, four proposed states have to be reached. Bayesian networks combining speech and non-speech modalities during user goal identification are used to estimate probability that each grounding state has been reached. These probabilities serve as a base for detecting whether the user is attending to the conversation, as well as for deciding on an alternative input modality (e.g., buttons) when the speech modality is unreliable. The Bayesian networks used in the grounding model are specially designed for modularity and computationally efficient inference. The potential of the proposed model is demonstrated comparing a conversational system for the mobile service robot RoboX employing only speech recognition for user goal identification, and a system equipped with multimodal grounding. The evaluation experiments use component and system level metrics for technical (objective) and user-based (subjective) evaluation with multimodal data collected during the conversations of the robot RoboX with users.

[1]  Anders Green,et al.  Involving users in the design of a mobile office robot , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2]  Vladimir Pavlovic,et al.  Dynamic bayesian networks for information fusion with applications to human-computer interfaces , 1999 .

[3]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[4]  Hideki Shimomura,et al.  Real World Speech Interaction with a Humanoid Robot on a Layered Robot Behavior Control Architecture , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[5]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[6]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[7]  Dafydd Gibbon,et al.  Spoken language system and corpus design , 1998 .

[8]  Roland Siegwart,et al.  On developing a voice-enabled interface for interactive tour-guide robots , 2003, Adv. Robotics.

[9]  Niels Ole Bernsen,et al.  Evaluation and usability of multimodal spoken language dialogue systems , 2004, Speech Commun..

[10]  Eric Horvitz,et al.  Deeplistener: harnessing expected utility to guide clarification dialog in spoken language systems , 2000, INTERSPEECH.

[11]  Sebastian Lang,et al.  Providing the basis for human-robot-interaction: a multi-modal attention system for a mobile robot , 2003, ICMI '03.

[12]  Candace L. Sidner,et al.  Where to look: a study of human-robot engagement , 2004, IUI '04.

[13]  Roland Siegwart,et al.  Robots meet Humans-interaction in public spaces , 2005, IEEE Transactions on Industrial Electronics.

[14]  Roland Siegwart,et al.  The interactive autonomous mobile system RoboX , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Jannik Fritsch,et al.  Human-style interaction with a robot for cooperative learning of scene objects , 2005, ICMI '05.

[16]  David Traum,et al.  Computational Models of Grounding in Collaborative Systems , 1999 .

[17]  Herbert H. Clark,et al.  Contributing to Discourse , 1989, Cogn. Sci..

[18]  Eric Horvitz,et al.  Uncertainty, Utility, and Misunderstanding: A Decision-Theoretic Perspective on Grounding in Conversational Systems , 1999 .

[19]  M. Kleinehagenbrock,et al.  Person tracking with a mobile robot based on multi-modal anchoring , 2002, Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication.

[20]  Hiroaki Kitano,et al.  Real-Time Auditory and Visual Multiple-Object Tracking for Humanoids , 2001, IJCAI.

[21]  Wolfram Burgard,et al.  Experiences with an Interactive Museum Tour-Guide Robot , 1999, Artif. Intell..

[22]  Ljubomir Josifovski,et al.  Robust Automatic Speech Recognition with Missing and Unreliable Data , 2003 .

[23]  Dafydd Gibbon,et al.  Handbook of Multimodal and Spoken Dialogue Systems , 2000 .

[24]  Plamen J. Prodanov,et al.  Decision Networks for Repair Strategies in Speech-Based Interaction with Mobile Tour-Guide Robots , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[25]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[26]  Roger K. Moore,et al.  Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation , 2000 .

[27]  Tetsuya Ogata,et al.  Spatially mapping of friendliness for human-robot interaction , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Plamen J. Prodanov,et al.  Bayesian networks based multi-modality fusion for error handling in human-robot dialogues under noisy conditions , 2005, Speech Commun..

[29]  Stephanie D. Teasley,et al.  Perspectives on socially shared cognition , 1991 .

[30]  D G Bobrow,et al.  Applications of Artificial Intelligence , 1999 .

[31]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[32]  Jan P. H. van Santen,et al.  Review of Handbook of standards and resources for spoken language systems by Dafydd Gibbon, Roger Moore, and Richard Winski. Mouton de Gruyter 1997. , 1998 .

[33]  Susan Brennan,et al.  Interaction and feedback in a spoken language system: a theoretical framework , 1995, Knowl. Based Syst..

[34]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[35]  Sung-Bae Cho,et al.  A Hierarchical Bayesian Network for Mixed-Initiative Human-Robot Interaction , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[36]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[37]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[38]  Pierre Dillenbourg,et al.  Towards a Normative Model of Grounding in Collaboration , 2007 .

[39]  Eric Horvitz,et al.  Continuous listening for unconstrained spoken dialog , 2000, INTERSPEECH.

[40]  Roland Siegwart,et al.  Visitor Flow Management using Human-Robot Interaction at Expo.02 , 2002 .

[41]  Eric Horvitz,et al.  Harnessing Models of Users' Goals to Mediate Clarification Dialog in Spoken Language Systems , 2001, User Modeling.

[42]  Eric Horvitz,et al.  A computational architecture for conversation , 1999 .

[43]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[44]  Herbert H. Clark,et al.  Grounding in communication , 1991, Perspectives on socially shared cognition.