论文信息 - Visual Referring Expression Recognition: What Do Systems Actually Learn?

Visual Referring Expression Recognition: What Do Systems Actually Learn?

We present an empirical analysis of state-of-the-art systems for referring expression recognition – the task of identifying the object in an image referred to by a natural language expression – with the goal of gaining insight into how these systems reason about language and vision. Surprisingly, we find strong evidence that even sophisticated and linguistically-motivated models for this task may ignore linguistic structure, instead relying on shallow correlations introduced by unintended biases in the data selection and annotation process. For example, we show that a system trained and tested on the input image without the input referring expression can achieve a precision of 71.2% in top-2 predictions. Furthermore, a system that predicts only the object category given the input can achieve a precision of 84.2% in top-2 predictions. These surprisingly positive results for what should be deficient prediction scenarios suggest that careful analysis of what our models are learning – and further, how our data is constructed – is critical as we seek to make substantive progress on grounded language tasks.

[1] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.

[3] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[6] Michelle X. Zhou,et al. A probabilistic approach to reference resolution in multimodal user interfaces , 2004, IUI '04.

[7] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[8] Trevor Darrell,et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[10] Louis-Philippe Morency,et al. Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[11] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Danqi Chen,et al. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[13] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[14] Larry S. Davis,et al. Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[15] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[16] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[17] Changsong Liu,et al. Collaborative Effort towards Common Ground in Situated Human-Robot Dialogue , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[18] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[19] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[23] David Schlangen,et al. A simple generative model of incremental reference resolution for situated dialogue , 2017, Comput. Speech Lang..

[24] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Saurabh Gupta,et al. Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[26] Joyce Yue Chai,et al. Integrating word acquisition and referential grounding towards physical world interaction , 2012, ICMI '12.

[27] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[28] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Matthias Scheutz,et al. Situated open world reference resolution for human-robot dialogue , 2016, 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[30] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[31] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[32] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[33] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[34] Allan Jabri,et al. Revisiting Visual Question Answering Baselines , 2016, ECCV.

[35] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.