Automatic Extraction of Figures from Scholarly Documents

Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple ``figures'' such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

[1]  Jean-Luc Meunier,et al.  Optimized XY-cut for determining a page reading order , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[2]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[3]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  C. Lee Giles,et al.  An Architecture for Information Extraction from Figures in Digital Libraries , 2015, WWW.

[5]  Dan S. Bloomberg Multiresolution morphological analysis of document images , 1992, Other Conferences.

[6]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Sargur N. Srihari Document Image Understanding , 1986, FJCC.

[8]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[9]  C. Lee Giles,et al.  Segregating and extracting overlapping data points in two-dimensional plots , 2008, JCDL '08.

[10]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[11]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[12]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[14]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.

[15]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[16]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[17]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.