Linguistic issues behind visual question answering

Abstract Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually‐grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually‐grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub‐field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

[1]  Elia Bruni,et al.  The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue , 2019, ACL.

[2]  Albert Gatt,et al.  Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks , 2020, MMSR.

[3]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  David Schlangen,et al.  Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories , 2019, ACL.

[5]  Sandro Pezzelle,et al.  Big Generalizations with Small Data: Exploring the Role of Training Samples in Learning Adjectives of Size , 2019, EMNLP.

[6]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[7]  Terry Winograd,et al.  Understanding natural language , 1974 .

[8]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Christopher Kanan,et al.  TallyQA: Answering Complex Counting Questions , 2018, AAAI.

[10]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Christopher Potts,et al.  Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding , 2017, TACL.

[12]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[13]  Walter Daelemans,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.

[14]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[15]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[16]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[17]  Koji Mineshima,et al.  Multimodal Logical Inference System for Visual-Textual Entailment , 2019, ACL.

[18]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Anoop Cherian,et al.  End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Leonidas J. Guibas,et al.  Shapeglot: Learning Language for Shape Differentiation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Henning Müller,et al.  Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , 2020, CLEF.

[23]  Raquel Fernández,et al.  The Devil is in the Details: A Magnifying Glass for the GuessWhich Visual Dialogue Game , 2019 .

[24]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[25]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[26]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[29]  R. Borsley,et al.  Head-Driven Phrase Structure Grammar: The handbook , 2018 .

[30]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[31]  Christopher Kennedy Vagueness and grammar: the semantics of relative and absolute gradable adjectives , 2007 .

[32]  Sandro Pezzelle,et al.  Be Precise or Fuzzy: Learning the Meaning of Cardinals and Quantifiers from Vision , 2017, EACL.

[33]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[35]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[36]  Sandro Pezzelle,et al.  Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts , 2019, EMNLP.

[37]  Raffaella Bernardi,et al.  Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat , 2018, NAACL.

[38]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[39]  Sandro Pezzelle,et al.  FOIL it! Find One mismatch between Image and Language caption , 2017, ACL.

[40]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[41]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Raffaella Bernardi,et al.  There Is No Logical Negation Here, But There Are Alternatives: Modeling Conversational Negation with Distributional Semantics , 2016, Computational Linguistics.

[43]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Christopher Potts,et al.  Pragmatically Informative Image Captioning with Character-Level Inference , 2018, NAACL.

[45]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Chitta Baral,et al.  VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, ECCV.

[48]  Julia Hockenmaier,et al.  Learning to execute instructions in a Minecraft dialogue , 2020, ACL.

[49]  Wei Han,et al.  Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering , 2020, COLING.

[50]  Afsaneh Fazly,et al.  A Probabilistic Computational Model of Cross-Situational Word Learning , 2010, Cogn. Sci..

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[52]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Simeon Schüz,et al.  Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms , 2020, ACL.

[55]  Lucia Specia,et al.  Object Counts! Bringing Explicit Detections Back into Image Captioning , 2018, NAACL.

[56]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[58]  Sandro Pezzelle,et al.  “Look, some Green Circles!”: Learning to Quantify from Images , 2016, VL@ACL.

[59]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[61]  John R. Searle,et al.  Minds, brains, and programs , 1980, Behavioral and Brain Sciences.

[62]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[63]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  B. Partee Lexical semantics and compositionality. , 1995 .

[65]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[67]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Verena Rieser,et al.  History for Visual Dialog: Do we really need it? , 2020, ACL.

[69]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[70]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[71]  Willard Van Orman Quine,et al.  Word and Object , 1960 .

[72]  Brian L. Price,et al.  DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]  Michael C. Frank,et al.  A pragmatic account of the processing of negative sentences , 2014, CogSci.

[74]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[75]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[76]  Donald Geman,et al.  Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[77]  Ajay Divakaran,et al.  Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[78]  Vicente Ordonez,et al.  Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries , 2019, NeurIPS.

[79]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[80]  Christopher Potts,et al.  Learning to Generate Compositional Color Descriptions , 2016, EMNLP.

[81]  Carina Silberer,et al.  Visually Grounded Meaning Representations , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[83]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for deep neural networks , 2019, NeurIPS.

[84]  Mariella Dimiccoli,et al.  Learning quantification from images: A structured neural architecture , 2018, Nat. Lang. Eng..

[85]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[86]  R. Bernardi,et al.  Quantifiers in a Multimodal World: Hallucinating Vision with Language and Sound , 2019, Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics.

[87]  Sandro Pezzelle,et al.  Comparatives, Quantifiers, Proportions: A Multi-Task Model for the Learning of Quantities from Vision , 2018, NAACL-HLT.

[88]  J. Ginzburg,et al.  Wh-Questions are understood before polar-questions: Evidence from English, German, and Chinese , 2020, Journal of Child Language.

[89]  Qi Wu,et al.  Visual Question Answering: A Tutorial , 2017, IEEE Signal Processing Magazine.

[90]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[91]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[92]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[94]  Dan Klein,et al.  Reasoning about Pragmatics with Neural Listeners and Speakers , 2016, EMNLP.

[95]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[96]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[97]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[98]  Roser Morante,et al.  Pragmatic Factors in Image Description: The Case of Negations , 2016, VL@ACL.

[99]  Snehasis Mukherjee,et al.  Visual Question Answering using Deep Learning: A Survey and Performance Analysis , 2019, ArXiv.

[100]  Oliver Lemon,et al.  Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games , 2020, COLING.

[101]  Christian Wolf,et al.  Roses are Red, Violets are Blue… But Should VQA expect Them To? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Albert Gatt,et al.  Grounded Textual Entailment , 2018, COLING.

[103]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[104]  Sandro Pezzelle,et al.  Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision , 2020, FINDINGS.

[105]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[107]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[108]  Marco Baroni,et al.  Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.

[109]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[110]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[111]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[112]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[113]  Xiao Lin,et al.  Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Alexander Kuhnle,et al.  ShapeWorld - A new test methodology for multimodal language understanding , 2017, ArXiv.

[115]  Christopher Kanan,et al.  Challenges and Prospects in Vision and Language Research , 2019, Front. Artif. Intell..

[116]  Licheng Yu,et al.  TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.

[117]  J. Firth,et al.  Papers in linguistics, 1934-1951 , 1957 .

[118]  M. Tomasello,et al.  Social cognition, joint attention, and communicative competence from 9 to 15 months of age. , 1998, Monographs of the Society for Research in Child Development.

[119]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[120]  Eric Horvitz,et al.  SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions , 2020, ArXiv.

[121]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[122]  Parisa Kordjamshidi,et al.  Cross-Modality Relevance for Reasoning on Language and Vision , 2020, ACL.

[123]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[124]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[125]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[126]  Ramprasaath R. Selvaraju,et al.  Counting Everyday Objects in Everyday Scenes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[127]  Yash Goyal,et al.  Towards Transparent AI Systems: Interpreting Visual Question Answering Models , 2016, 1608.08974.

[128]  Gordon Christie,et al.  Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes , 2016, EMNLP.

[129]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[130]  Sina Zarrieß,et al.  Tell Me More: A Dataset of Visual Scene Description Sequences , 2019, INLG.

[131]  Luciana Benotti,et al.  On the role of effective and referring questions in GuessWhat?! , 2020, ALVR.

[132]  Chitta Baral,et al.  MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering , 2020, EMNLP.

[133]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[134]  Shimon Ullman,et al.  Do You See What I Mean? Visual Resolution of Linguistic Ambiguities , 2015, EMNLP.

[135]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[136]  Bonnie L. Webber,et al.  Special issue on interactive question answering: Introduction , 2009, Natural Language Engineering.

[137]  Binsu C. Kovoor,et al.  Visual question answering: a state-of-the-art review , 2020, Artificial Intelligence Review.

[138]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[139]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.