Generating Accurate Caption Units for Figure Captioning

Scientific-style figures are commonly used on the web to present numerical information. Captions that tell accurate figure information and sound natural would significantly improve figure accessibility. In this paper, we present promising results on machine figure captioning. A recent corpus analysis of real-world captions reveals that machine figure captioning systems should start by generating accurate caption units. We formulate the caption unit generation problem as a controlled captioning problem. Given a caption unit type as a control signal, a model generates an accurate caption unit of that type. As a proof-of-concept on single bar charts, we propose a model, FigJAM, that achieves this goal through utilizing metadata information and a joint static and dynamic dictionary. Quantitative evaluations with two datasets from the figure question answering task show that our model can generate more accurate caption units than competitive baseline models. A user study with ten human experts confirms the value of machine-generated caption units in their standalone accuracy and naturalness. Finally, a post-editing simulation study demonstrates the potential for models to paraphrase and stitch together single-type caption units into multi-type captions by learning from data.

[1]  Kathleen F. McCoy,et al.  Generating Textual Summaries of Bar Charts , 2008, INLG.

[2]  Daniel L. Chester,et al.  Getting Computers to See Information Graphics So Users Do Not Have to , 2005, ISMIS.

[3]  Ansgar Scherp,et al.  Text Localization in Scientific Figures using Fully Convolutional Neural Networks on Limited Training Data , 2019, DocEng.

[4]  Min-Yen Kan,et al.  Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[5]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[6]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[7]  Michel Simard,et al.  Statistical Phrase-Based Post-Editing , 2007, NAACL.

[8]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[10]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[11]  Johanna D. Moore,et al.  Describing Complex Charts in Natural Language: A Caption Generation System , 1998, CL.

[12]  Rita Cucchiara,et al.  Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tae-Hyun Oh,et al.  Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Dilek Z. Hakkani-Tür,et al.  To Plan or not to Plan? Discourse Planning in Slot-Value Informed Sequence to Sequence Models for Language Generation , 2017, INTERSPEECH.

[15]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[16]  Dimosthenis Karatzas,et al.  Good News, Everyone! Context Driven Entity-Aware Captioning for News Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  David A. Shamma,et al.  An Uninteresting Tour Through Why Our Research Papers Aren't Accessible , 2016, CHI Extended Abstracts.

[18]  Charles Chen,et al.  Figure Captioning with Relation Maps for Reasoning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Sandra Carberry,et al.  Generating Summaries of Line Graphs , 2014, INLG.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Nancy Green,et al.  Towards generating textual summaries of graphs , 2001, HCI.

[25]  Ingrid Zukerman,et al.  The automated understanding of simple bar charts , 2011, Artif. Intell..

[26]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[27]  David Grangier,et al.  QuickEdit: Editing Text & Translations by Crossing Words Out , 2017, NAACL.

[28]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[29]  Jeffrey P. Bigham,et al.  Creating accessible PDFs for conference proceedings , 2015, W4A.

[30]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[31]  Cong Yu,et al.  Generating Titles for Web Tables , 2018, WWW.

[32]  M. Corio,et al.  Generation of texts for information graphics , 1999 .

[33]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[34]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[35]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[36]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yue Zheng,et al.  Intention Oriented Image Captions With Guiding Objects , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[39]  Ichiro Kobayashi,et al.  Linguistic summarization using a weighted N-gram language model based on the similarity of time-series data , 2016, 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[40]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[41]  C. Lee Giles,et al.  Automatic Summary Generation for Scientific Data Charts , 2016, AAAI Workshop: Scholarly Big Data.

[42]  Ingrid Zukerman,et al.  Exploring and Exploiting the Limited Utility of Captions in Recognizing Intention in Information Graphics , 2005, ACL.

[43]  Jeffrey Heer,et al.  Reverse‐Engineering Visualizations: Recovering Visual Encodings from Chart Images , 2017, Comput. Graph. Forum.

[44]  Ajay Joshi,et al.  LEAF-QA: Locate, Encode & Attend for Figure Question Answering , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[45]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[46]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Brian L. Price,et al.  DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Qian Yang,et al.  Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design , 2020, CHI.

[50]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Eunyee Koh,et al.  A Formative Study on Designing Accurate and Natural Figure Captioning Systems , 2020, CHI Extended Abstracts.

[52]  Guy Lapalme,et al.  PostGraphe: A System for the Generation of Statistical Graphics and Text , 1996, INLG.

[53]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Razvan C. Bunescu,et al.  Figure Captioning with Reasoning and Sequence-Level Training , 2019, ArXiv.

[55]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.