An Architecture for Information Extraction from Figures in Digital Libraries

Scholarly documents contain multiple figures representing experimental findings. These figures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such figures. Our architecture consists of the following modules: 1. An extractor for figures and associated metadata (figure captions and mentions) from PDF documents; 2. A Search engine on the extracted figures and metadata; 3. An image processing module for automated data extraction from the figures and 4. A natural language processing module to understand the semantics of the figure. We discuss the challenges in each step, report an extractor algorithm to extract vector graphics from scholarly documents and a classification algorithm for figures. Our extractor algorithm improves the state of the art by more than 10% and the classification process is very scalable, yet achieves 85\% accuracy. We also describe a semi-automatic system for data extraction from figures which is integrated with our search engine to improve user experience.

[1]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[2]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[3]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[5]  Richard Burns,et al.  The Role of Cognitive Modeling in an Automated System for Understanding Bar Charts ? , 2008 .

[6]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[7]  Peng Wu,et al.  Recognizing the intended message of line graphs: methodology and applications , 2012 .

[8]  Trevor Darrell,et al.  Autotagging Facebook: Social network context improves photo annotation , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[10]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[11]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[12]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Stephanie Elzer Schwartz,et al.  Information graphics: an untapped resource for digital libraries , 2006, SIGIR.

[14]  Larry S. Davis,et al.  Classifying Computer Generated Charts , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[15]  Sandra Carberry,et al.  Exploiting Cognitive Psychology Research for Recognizing Intention in Information Graphics , 2003 .

[16]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Margo I. Seltzer,et al.  Network Coordinates in the Wild , 2007, NSDI.

[18]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[19]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[20]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.

[21]  Nancy Green,et al.  Understanding Information Graphics: A Discourse-Level Problem , 2003, SIGDIAL Workshop.

[22]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[23]  C. Lee Giles,et al.  Segregating and extracting overlapping data points in two-dimensional plots , 2008, JCDL '08.