Automated Data Extraction from Scholarly Line Graphs

Line graphs are ubiquitous in scholarly papers. They are usually generated from a data table and often used to compare performances of various methods. The data in these figures can not be accessed. Manual extraction of this data is hard and not scalable. On the other hand, automated systems for such data extraction task is not yet available. We report an analysis of line graphs to explain the challenges of building a fully automated data extraction system. Next, we describe a system for automated data extraction from color line graphs. Our system has multiple components: image classification for identifying line graphs; text extraction from the figures and curve extraction. For the classification, we show that unsupervised feature learning outperforms traditional low-level image descriptors by 10%. For the text extraction, our heuristics outperforms the accuracy of the previous method by 29%. We also propose a novel curve extraction method that has an average accuracy of 82%. A large partially annotated dataset for future research is described.

[1]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[2]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[3]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[4]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[5]  Larry S. Davis,et al.  Classifying Computer Generated Charts , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[6]  Margo I. Seltzer,et al.  Network Coordinates in the Wild , 2007, NSDI.

[7]  Stephanie Elzer Schwartz,et al.  Information graphics: an untapped resource for digital libraries , 2006, SIGIR.

[8]  Siyu Zhu,et al.  Label Detection and Recognition for USPTO Images Using Convolutional K-Means Feature Quantization and Ada-Boost , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  R. P. Futrelle Strategies for diagram understanding: generalized equivalence, spatial/object pyramids and animate vision , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[11]  Vladimir Kolmogorov,et al.  An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Robert P. Futrelle,et al.  Graphics Recognition in PDF documents , .

[13]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[14]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[15]  Jean-Michel Jolion,et al.  Object count/area graphs for the evaluation of object detection and segmentation algorithms , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.