PDFFigures 2.0: Mining figures from research papers

Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. To evaluate our work, we introduce a new dataset of computer science papers, along with ground truth labels for the locations of the figures, tables, and captions within them. Our algorithm achieves impressive results (94% precision at 90% recall) on this dataset surpassing previous state of the art. Further, we show how our framework was used to extract figures from a corpus of over one million papers, and how the resulting extractions were integrated into the user interface of a smart academic search engine, Semantic Scholar (www.semanticscholar.org). Finally, we present results of exploratory data analysis completed on the extracted figures as well as an extension of our method for the task of section title extraction. We release our dataset and code on our project webpage for enabling future research (http://pdffigures2.allenai.org).

[1]  C. Lee Giles,et al.  Automatic Extraction of Figures from Scholarly Documents , 2015, DocEng.

[2]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[3]  Noah Siegel Understanding Charts in Research Papers : A Learning Approach , 2015 .

[4]  References , 1971 .

[5]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[6]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[7]  C. Lee Giles,et al.  Automated Data Extraction from Scholarly Line Graphs , 2015 .

[8]  C. Lee Giles,et al.  Curve separation for line graphs in scholarly documents , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[9]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[10]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[11]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[12]  Javier Nogueras-Iso,et al.  Automatic Extraction of Figures from Scientific Publications in High-Energy Physics , 2013 .

[13]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[14]  Dominik Moritz Text detection in screen images with a Convolutional Neural Network , 2017, J. Open Source Softw..

[15]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.