Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments

Referring to objects in a natural and unambiguous manner is crucial for effective human-robot interaction. Previous research on learning-based referring expressions has focused primarily on comprehension tasks, while generating referring expressions is still mostly limited to rule-based methods. In this work, we propose a two-stage approach that relies on deep learning for estimating spatial relations to describe an object naturally and unambiguously with a referring expression. We compare our method to the state of the art algorithm in ambiguous environments (e.g., environments that include very similar objects with similar relationships). We show that our method generates referring expressions that people find to be more accurate (~30% better) and would prefer to use (~32% more often).

[1]  Reinhard Moratz,et al.  Spatial Reference in Linguistic Human-Robot Interaction: Iterative, Empirically Supported Development of a Model of Projective Relations , 2006, Spatial Cogn. Comput..

[2]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Emiel Krahmer,et al.  Graphs and Spatial Relations in the Generation of Referring Expressions , 2013, ENLG.

[5]  Manuela M. Veloso,et al.  Learning environmental knowledge from task-based human-robot dialog , 2013, 2013 IEEE International Conference on Robotics and Automation.

[6]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ivana Kruijff-Korbayová,et al.  A Portfolio Approach to Algorithm Selection , 2009, IJCAI.

[8]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[10]  Mario Fritz,et al.  A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation , 2014, ArXiv.

[11]  Matthias Scheutz,et al.  Spatial Referring Expression Generation for HRI: Algorithms and Evaluation Framework , 2017, AAAI Fall Symposia.

[12]  Dan Klein,et al.  Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Tim Oates,et al.  Identifying spatial relations in images using convolutional neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[14]  Nicholas Roy,et al.  Efficient Grounding of Abstract Spatial Concepts for Natural Language Interaction with Robot Manipulators , 2016, Robotics: Science and Systems.

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Mohit Shridhar,et al.  Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction , 2017, ArXiv.

[17]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[18]  Robert Dale,et al.  The Use of Spatial Relations in Referring Expression Generation , 2008, INLG.

[19]  Marjorie Skubic,et al.  Spatial language for human-robot dialogs , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Matthias Scheutz,et al.  Referring Expression Generation under Uncertainty: Algorithm and Evaluation Framework , 2017, INLG.

[21]  Matthias Scheutz,et al.  Referring Expression Generation under Uncertainty in Integrated Robot Architectures , .

[22]  Robert Dale,et al.  Computational Interpretations of the Gricean Maxims in the Generation of Referring Expressions , 1995, Cogn. Sci..

[23]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[24]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xueying Zhang,et al.  Rule-Based Extraction of Spatial Relations in Natural Language Text , 2009, 2009 International Conference on Computational Intelligence and Software Engineering.

[26]  Kuniyuki Takahashi,et al.  Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  J. R. Hurford,et al.  Semantics: A Coursebook , 1983 .

[28]  Mohit Shridhar,et al.  Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[29]  Nick Hawes,et al.  Incremental , multi-level processing for comprehending situated dialogue in human-robot interaction , 2007 .