Exploring Temporal Dependencies in Multimodal Referring Expressions with Mixed Reality

In collaborative tasks, people rely both on verbal and non-verbal cues simultaneously to communicate with each other. For human-robot interaction to run smoothly and naturally, a robot should be equipped with the ability to robustly disambiguate referring expressions. In this work, we propose a model that can disambiguate multimodal fetching requests using modalities such as head movements, hand gestures, and speech. We analysed the acquired data from mixed reality experiments and formulated a hypothesis that modelling temporal dependencies of events in these three modalities increases the model's predictive power. We evaluated our model on a Bayesian framework to interpret referring expressions with and without exploiting a temporal prior.

[1]  Katarzyna Harezlak,et al.  Towards Accurate Eye Tracker Calibration - Methods and Procedures , 2014, KES.

[2]  Philippe A. Palanque,et al.  Fusion engines for multimodal input: a survey , 2009, ICMI-MLMI '09.

[3]  Wolfram Burgard,et al.  Probabilistic Robotics (Intelligent Robotics and Autonomous Agents) , 2005 .

[4]  Yuichiro Yoshikawa,et al.  Robot gains social intelligence through multimodal deep reinforcement learning , 2016, 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).

[5]  Guy Hoffman,et al.  Computational Human-Robot Interaction , 2016, Found. Trends Robotics.

[6]  David Whitney,et al.  Interpreting multimodal referring expressions in real time , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Janet Beavin Bavelas,et al.  Hand and Facial Gestures in Conversational Interaction , 2014 .

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Changsong Liu,et al.  Collaborative Effort towards Common Ground in Situated Human-Robot Dialogue , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10]  Dimosthenis Kontogiorgos,et al.  Multimodal Reference Resolution In Collaborative Assembly Tasks , 2018, MA3HMI@ICMI.

[11]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[12]  Radu Horaud,et al.  Deep Reinforcement Learning for Audio-Visual Gaze Control , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Arman Savran,et al.  Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities , 2015, IEEE Transactions on Cybernetics.

[14]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[15]  Moreno I. Coco,et al.  Action Anticipation: Reading the Intentions of Humans and Robots , 2018, IEEE Robotics and Automation Letters.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Ville Kyrki,et al.  Probabilistic Mapping of Human Visual Attention from Head Pose Estimation , 2017, Front. Robot. AI.

[18]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[19]  Patrick Gebhard,et al.  Exploring a Model of Gaze for Grounding in Multimodal HRI , 2014, ICMI.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Bilge Mutlu,et al.  Using gaze patterns to predict task intent in collaboration , 2015, Front. Psychol..

[22]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[24]  Takenobu Tokunaga,et al.  A Unified Probabilistic Approach to Referring Expressions , 2012, SIGDIAL Conference.

[25]  Danica Kragic,et al.  A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction , 2018, 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[26]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[27]  Siddhartha S. Srinivasa,et al.  Predicting User Intent Through Eye Gaze for Shared Autonomy , 2016, AAAI Fall Symposia.