Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Without modifications, state-of-the-art VQA algorithms perform poorly on this task. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart.

[1]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[2]  K. G. Srinivasa,et al.  Extraction and interpretation of charts in technical documents , 2013, 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[3]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[4]  Dan Klein,et al.  Deep Compositional Question Answering with Neural Module Networks , 2015, ArXiv.

[5]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[6]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[7]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[8]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[10]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[11]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ali Farhadi,et al.  A Diagram is Worth a Dozen Images , 2016, ECCV.

[13]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[14]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  David J. Crandall,et al.  A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[20]  Christopher Kanan,et al.  Data Augmentation for Visual Question Answering , 2017, INLG.

[21]  David S. Rosenberg,et al.  Scatteract: Automated Extraction of Data from Scatter Plots , 2017, ECML/PKDD.

[22]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Daniel Kifer,et al.  Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[25]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Vahid Kazemi,et al.  Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[27]  Jeffrey Heer,et al.  Reverse‐Engineering Visualizations: Recovering Visual Encodings from Chart Images , 2017, Comput. Graph. Forum.

[28]  Asma Ben Abacha,et al.  Descriptor : A dataset of clinically generated visual questions and answers about radiology images , 2018 .

[29]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[30]  Brian L. Price,et al.  DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[32]  Carl Doersch,et al.  The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR , 2018, ArXiv.

[33]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[35]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Kilian Q. Weinberger,et al.  CondenseNet: An Efficient DenseNet Using Learned Group Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Mitesh M. Khapra,et al.  A Question-Answering framework for plots using Deep learning , 2018, ArXiv.

[38]  Christopher Kanan,et al.  TallyQA: Answering Complex Counting Questions , 2018, AAAI.

[39]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Christopher Kanan,et al.  Answer Them All! Toward Universal Visual Question Answering Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Christopher Kanan,et al.  Challenges and Prospects in Vision and Language Research , 2019, Front. Artif. Intell..

[42]  Christopher Kanan,et al.  VQD: Visual Query Detection In Natural Scenes , 2019, NAACL.