Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. While many "off-the-shelf" tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract tables, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reasoning about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for leveraging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96% precision at 92% recall when tested against this dataset, surpassing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research.

[1]  Hagit Shatkay,et al.  An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[2]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[3]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  Michael Krauthammer,et al.  Yale Image Finder (YIF): a new search engine for retrieving biomedical images , 2008, Bioinform..

[5]  Ruslan Salakhutdinov,et al.  The Power of Asymmetry in Binary Hashing , 2013, NIPS.

[6]  Wei Liu,et al.  Large Graph Construction for Scalable Semi-Supervised Learning , 2010, ICML.

[7]  Edoardo M. Airoldi,et al.  A Consistent Histogram Estimator for Exchangeable Graph Models , 2014, ICML.

[8]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[9]  Shah Khusro,et al.  On methods and tools of table detection, extraction and annotation in PDF documents , 2015, J. Inf. Sci..

[10]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[11]  Edith Elkind,et al.  False-Name Manipulations in Weighted Voting Games , 2014, J. Artif. Intell. Res..

[12]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[13]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[14]  Javier Nogueras-Iso,et al.  Automatic Extraction of Figures from Scientific Publications in High-Energy Physics , 2013 .