SciCap: Generating Captions for Scientific Figures

Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to convey-ing effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informa-tive, high-quality captions for scientific figures. To this end, we introduce S CI C AP , 1 a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing – including figure-type classification, sub-figure identifica-tion, text normalization, and caption text selection – S CI C AP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.

[1]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[2]  Francis C. M. Lau,et al.  Layered Decoding for Protograph-Based Low-Density Parity-Check Hadamard Codes , 2020, IEEE Communications Letters.

[3]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[4]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[5]  Meredith Ringel Morris,et al.  Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind , 2017, HCOMP.

[6]  Pierre Dragicevic,et al.  Supporting the design and fabrication of physical visualizations , 2014, CHI.

[7]  Giuseppe Carenini,et al.  Neural Data-Driven Captioning of Time-Series Line Charts , 2020, AVI.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Enamul Hoque,et al.  Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model , 2020, INLG.

[10]  Razvan C. Bunescu,et al.  Neural caption generation over figures , 2019, UbiComp/ISWC Adjunct.

[11]  Dilek Z. Hakkani-Tür,et al.  MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension , 2019, AAAI.

[12]  Surender Baswana,et al.  Incremental DFS algorithms: a theoretical and experimental study , 2018, SODA.

[13]  David J. Crandall,et al.  A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[14]  Alexander A. Alemi,et al.  On the Use of ArXiv as a Dataset , 2019, ArXiv.

[15]  Charles Chen,et al.  Figure Captioning with Relation Maps for Reasoning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[17]  Sana Malik,et al.  Generating Accurate Caption Units for Figure Captioning , 2021, WWW.

[18]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[19]  Dragan Ahmetovic,et al.  μGraph: Haptic Exploration and Editing of 3D Chemical Diagrams , 2019, ASSETS.

[20]  Eunyee Koh,et al.  A Formative Study on Designing Accurate and Natural Figure Captioning Systems , 2020, CHI Extended Abstracts.

[21]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[22]  Shaomei Wu,et al.  Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service , 2017, CSCW.

[23]  Brian L. Price,et al.  DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Razvan C. Bunescu,et al.  Figure Captioning with Reasoning and Sequence-Level Training , 2019, ArXiv.

[25]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[26]  Saiganesh Swaminathan,et al.  Linespace: A Sensemaking Platform for the Blind , 2016, CHI.

[27]  Richard Zanibbi,et al.  ScanSSD: Scanning Single Shot Detector for Mathematical Formulas in PDF Document Images , 2020, ArXiv.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  David J. Fleet,et al.  Building proteins in a day: Efficient 3D molecular reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Douglas W. Oard,et al.  Finding Old Answers to New Math Questions: The ARQMath Lab at CLEF 2020 , 2020, ECIR.

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Siyuan Feng,et al.  Quantifying Bias in Automatic Speech Recognition , 2021, ArXiv.

[33]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).