Multimodal Uncertainty Reduction for Intention Recognition in Human-Robot Interaction

Assistive robots can potentially improve the quality of life and personal independence of elderly people by supporting everyday life activities. To guarantee a safe and intuitive interaction between human and robot, human intentions need to be recognized automatically. As humans communicate their intentions multimodally, the use of multiple modalities for intention recognition may not just increase the robustness against failure of individual modalities but especially reduce the uncertainty about the intention to be recognized. This is desirable as particularly in direct interaction between robots and potentially vulnerable humans a minimal uncertainty about the situation as well as knowledge about this actual uncertainty is necessary. Thus, in contrast to existing methods, in this work a new approach for multimodal intention recognition is introduced that focuses on uncertainty reduction through classifier fusion. For the four considered modalities speech, gestures, gaze directions and scene objects individual intention classifiers are trained, all of which output a probability distribution over all possible intentions. By combining these output distributions using the Bayesian method Independent Opinion Pool [1] the uncertainty about the intention to be recognized can be decreased. The approach is evaluated in a collaborative human-robot interaction task with a 7-DoF robot arm. The results show that fused classifiers, which combine multiple modalities, outperform the respective individual base classifiers with respect to increased accuracy, robustness, and reduced uncertainty.

[1]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[2]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[3]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[4]  Dana Kulic,et al.  Estimating intent for human-robot interaction , 2003 .

[5]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[6]  Libor Preucil,et al.  Robust data fusion with occupancy grid , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  M. Hayhoe,et al.  Look-ahead fixations: anticipatory eye movements in natural tasks , 2007, Experimental Brain Research.

[8]  Mary M Hayhoe,et al.  Task and context determine where you look. , 2016, Journal of vision.

[9]  Uwe D. Hanebeck,et al.  Tractable probabilistic models for intention recognition based on expert knowledge , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Derek Elsaesser Sensor data fusion using a probability density grid , 2007, 2007 10th International Conference on Information Fusion.

[11]  Jean-Philippe Thiran,et al.  Using entropy as a stream reliability estimate for audio-visual speech recognition , 2008, 2008 16th European Signal Processing Conference.

[12]  Gerhard Rigoll,et al.  A Multimodal Human-Robot-Interaction Scenario: Working Together with an Industrial Robot , 2009, HCI.

[13]  Lei Shi,et al.  Multi-class classification for semantic labeling of places , 2010, 2010 11th International Conference on Control Automation Robotics & Vision.

[14]  Hao Su,et al.  Objects as Attributes for Scene Classification , 2010, ECCV Workshops.

[15]  Richard Kelley,et al.  Context-Based Bayesian Intent Recognition , 2012, IEEE Transactions on Autonomous Mental Development.

[16]  Monica N. Nicolescu,et al.  Deep networks for predicting human intent with respect to objects , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17]  Jan Peters,et al.  Probabilistic Movement Primitives , 2013, NIPS.

[18]  Hong Liu,et al.  Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Patric Bach,et al.  The affordance-matching hypothesis: how objects guide action understanding and prediction , 2014, Front. Hum. Neurosci..

[20]  Bilge Mutlu,et al.  Using gaze patterns to predict task intent in collaboration , 2015, Front. Psychol..

[21]  Minho Lee,et al.  Human intention understanding based on object affordance and action classification , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[22]  Jan Peters,et al.  Learning multiple collaborative tasks with a mixture of Interaction Primitives , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[24]  Sergio Escalera,et al.  Gesture based human multi-robot interaction , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[25]  Jian Huang,et al.  Multi-sensor based human motion intention recognition algorithm for walking-aid robot , 2015, 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[26]  Siddhartha S. Srinivasa,et al.  Predicting User Intent Through Eye Gaze for Shared Autonomy , 2016, AAAI Fall Symposia.

[27]  Teresa Zielinska,et al.  Predicting the Intention of Human Activities for Real-Time Human-Robot Interaction (HRI) , 2016, ICSR.

[28]  Petros Maragos,et al.  Multimodal human action recognition in assistive human-robot interaction , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Shuzhi Sam Ge,et al.  Intelligent speech control system for human-robot interaction , 2016, 2016 35th Chinese Control Conference (CCC).

[30]  Jinhua Xu,et al.  Object-Based Representation for Scene Classification , 2016, Canadian Conference on AI.

[31]  Wafa Johal,et al.  Starting engagement detection towards a companion robot using multimodal features , 2015, Robotics Auton. Syst..

[32]  Frédéric Lerasle,et al.  A multi-modal perception based assistive robotic system for the elderly , 2016, Comput. Vis. Image Underst..

[33]  Jimmy J. Lin,et al.  Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting , 2017, ArXiv.

[34]  Tiana Rakotovao Andriamahefa Integer Occupancy Grids : a probabilistic multi-sensor fusion framework for embedded perception. (Grille d'occupation entière : une méthode probabiliste de fusion multi-capteurs pour la perception embarquée) , 2017 .

[35]  Petros Maragos,et al.  Multimodal Signal Processing and Learning Aspects of Human-Robot Interaction for an Assistive Bathing Robot , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Teh Ying Wah,et al.  Data fusion and multiple classifier systems for human activity detection and health monitoring: Review and open research directions , 2019, Inf. Fusion.

[37]  Teresa Zielinska,et al.  Predicting Human Actions Taking into Account Object Affordances , 2018, Journal of Intelligent & Robotic Systems.