Figure Metadata Extraction from Digital Documents

Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

[1]  Hagit Shatkay,et al.  An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[2]  Kun Bai,et al.  Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Larry S. Davis,et al.  Classifying Computer Generated Charts , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[4]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[5]  Chew Lim Tan,et al.  Extraction of Vectorized Graphical Information from Scientific Chart Images , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[7]  Mirella Lapata,et al.  Automatic Paragraph Identification: A Study across Languages and Domains , 2004, EMNLP.

[8]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[9]  Prasenjit Mitra,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012, TOIS.