论文信息 - SciCap: Generating Captions for Scientific Figures

SciCap: Generating Captions for Scientific Figures

Researchers use ﬁgures to communicate rich, complex information in scientiﬁc papers. The captions of these ﬁgures are critical to convey-ing effective messages. However, low-quality ﬁgure captions commonly occur in scientiﬁc articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informa-tive, high-quality captions for scientiﬁc ﬁgures. To this end, we introduce S CI C AP , 1 a large-scale ﬁgure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing – including ﬁgure-type classiﬁcation, sub-ﬁgure identiﬁca-tion, text normalization, and caption text selection – S CI C AP contained more than two million ﬁgures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) ﬁgure type. The experimental results showed both opportunities and steep challenges of generating captions for scientiﬁc ﬁgures.

C. Lee Giles | Ting-Hao 'Kenneth' Huang | Ting-Yao Hsu | Ting-Hao ‘Kenneth’ Huang | Ting-Yao Hsu

[1] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[2] Francis C. M. Lau,et al. Layered Decoding for Protograph-Based Low-Density Parity-Check Hadamard Codes , 2020, IEEE Communications Letters.

[3] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[4] Ali Farhadi,et al. FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[5] Meredith Ringel Morris,et al. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind , 2017, HCOMP.

[6] Pierre Dragicevic,et al. Supporting the design and fabrication of physical visualizations , 2014, CHI.

[7] Giuseppe Carenini,et al. Neural Data-Driven Captioning of Time-Series Line Charts , 2020, AVI.

[8] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Enamul Hoque,et al. Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model , 2020, INLG.

[10] Razvan C. Bunescu,et al. Neural caption generation over figures , 2019, UbiComp/ISWC Adjunct.

[11] Dilek Z. Hakkani-Tür,et al. MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension , 2019, AAAI.

[12] Surender Baswana,et al. Incremental DFS algorithms: a theoretical and experimental study , 2018, SODA.

[13] David J. Crandall,et al. A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[14] Alexander A. Alemi,et al. On the Use of ArXiv as a Dataset , 2019, ArXiv.

[15] Charles Chen,et al. Figure Captioning with Relation Maps for Reasoning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[17] Sana Malik,et al. Generating Accurate Caption Units for Figure Captioning , 2021, WWW.

[18] Yoshua Bengio,et al. FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[19] Dragan Ahmetovic,et al. μGraph: Haptic Exploration and Editing of 3D Chemical Diagrams , 2019, ASSETS.

[20] Eunyee Koh,et al. A Formative Study on Designing Accurate and Natural Figure Captioning Systems , 2020, CHI Extended Abstracts.

[21] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[22] Shaomei Wu,et al. Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service , 2017, CSCW.

[23] Brian L. Price,et al. DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Razvan C. Bunescu,et al. Figure Captioning with Reasoning and Sequence-Level Training , 2019, ArXiv.

[25] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[26] Saiganesh Swaminathan,et al. Linespace: A Sensemaking Platform for the Blind , 2016, CHI.

[27] Richard Zanibbi,et al. ScanSSD: Scanning Single Shot Detector for Mathematical Formulas in PDF Document Images , 2020, ArXiv.

[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29] David J. Fleet,et al. Building proteins in a day: Efficient 3D molecular reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Douglas W. Oard,et al. Finding Old Answers to New Math Questions: The ARQMath Lab at CLEF 2020 , 2020, ECIR.

[31] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32] Siyuan Feng,et al. Quantifying Bias in Automatic Speech Recognition , 2021, ArXiv.

[33] Christopher Andreas Clark,et al. PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).