论文信息 - Exploring Temporal Dependencies in Multimodal Referring Expressions with Mixed Reality

Exploring Temporal Dependencies in Multimodal Referring Expressions with Mixed Reality

In collaborative tasks, people rely both on verbal and non-verbal cues simultaneously to communicate with each other. For human-robot interaction to run smoothly and naturally, a robot should be equipped with the ability to robustly disambiguate referring expressions. In this work, we propose a model that can disambiguate multimodal fetching requests using modalities such as head movements, hand gestures, and speech. We analysed the acquired data from mixed reality experiments and formulated a hypothesis that modelling temporal dependencies of events in these three modalities increases the model's predictive power. We evaluated our model on a Bayesian framework to interpret referring expressions with and without exploiting a temporal prior.

[1] Katarzyna Harezlak,et al. Towards Accurate Eye Tracker Calibration - Methods and Procedures , 2014, KES.

[2] Philippe A. Palanque,et al. Fusion engines for multimodal input: a survey , 2009, ICMI-MLMI '09.

[3] Wolfram Burgard,et al. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents) , 2005 .

[4] Yuichiro Yoshikawa,et al. Robot gains social intelligence through multimodal deep reinforcement learning , 2016, 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).

[5] Guy Hoffman,et al. Computational Human-Robot Interaction , 2016, Found. Trends Robotics.

[6] David Whitney,et al. Interpreting multimodal referring expressions in real time , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[7] Janet Beavin Bavelas,et al. Hand and Facial Gestures in Conversational Interaction , 2014 .

[8] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9] Changsong Liu,et al. Collaborative Effort towards Common Ground in Situated Human-Robot Dialogue , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10] Dimosthenis Kontogiorgos,et al. Multimodal Reference Resolution In Collaborative Assembly Tasks , 2018, MA3HMI@ICMI.

[11] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[12] Radu Horaud,et al. Deep Reinforcement Learning for Audio-Visual Gaze Control , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13] Arman Savran,et al. Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities , 2015, IEEE Transactions on Cybernetics.

[14] Matthew Turk,et al. Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[15] Moreno I. Coco,et al. Action Anticipation: Reading the Intentions of Humans and Robots , 2018, IEEE Robotics and Automation Letters.

[16] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[17] Ville Kyrki,et al. Probabilistic Mapping of Human Visual Attention from Head Pose Estimation , 2017, Front. Robot. AI.

[18] Bernhard Schölkopf,et al. A tutorial on support vector regression , 2004, Stat. Comput..

[19] Patrick Gebhard,et al. Exploring a Model of Gaze for Grounding in Multimodal HRI , 2014, ICMI.

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Bilge Mutlu,et al. Using gaze patterns to predict task intent in collaboration , 2015, Front. Psychol..

[22] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[24] Takenobu Tokunaga,et al. A Unified Probabilistic Approach to Referring Expressions , 2012, SIGDIAL Conference.

[25] Danica Kragic,et al. A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction , 2018, 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[26] Richard A. Bolt,et al. “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[27] Siddhartha S. Srinivasa,et al. Predicting User Intent Through Eye Gaze for Shared Autonomy , 2016, AAAI Fall Symposia.