Automated analysis of line plots in documents

Information graphics, such as graphs and plots, are used in technical documents to convey information to humans and to facilitate greater understanding. Usually, graphics are a key component in a technical document, as they enable the author to convey complex ideas in a simplified visual format. However, in an automatic text recognition system, which are typically used to digitize documents, the ideas conveyed in a graphical format are lost. We contend that the message or extracted information can be used to help better understand the ideas conveyed in the document. In scientific papers, line plots are the most commonly used graphic to represent experimental results in the form of correlation present between values represented on the axes. The contribution of our work is in the series of image processing algorithms that are used to automatically extract relevant information, including text and plot from graphics found in technical documents. We validate the approach by performing the experiments on a dataset of line plots obtained from scientific documents from computer science conference papers and evaluate the variation of a reconstructed curve from the original curve. Our algorithm achieves a classification accuracy of 91% across the dataset and successfully extracts the axes from 92% of line plots. Axes label extraction and line curve tracing are performed successfully in about half the line plots as well.

[1]  Ales Mishchenko,et al.  Chart image understanding and numerical data extraction , 2011, 2011 Sixth International Conference on Digital Information Management.

[2]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  V. Karthikeyani,et al.  Machine Learning Classification Algorithms to Recognize Chart Types in Portable Document Format (PDF) Files , 2012 .

[4]  Yan Liu,et al.  Review of chart recognition in document images , 2013, Electronic Imaging.

[5]  Ingrid Zukerman,et al.  The automated understanding of simple bar charts , 2011, Artif. Intell..

[6]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[7]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[8]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[9]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[10]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[11]  Chew Lim Tan,et al.  Bar Charts Recognition Using Hough Based Syntactic Segmentation , 2000, Diagrams.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[14]  Ales Mishchenko,et al.  Model-Based Chart Image Classification , 2011, ISVC.

[15]  Andreas Dengel,et al.  ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[17]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.

[18]  Kathleen F. McCoy,et al.  Towards Finding Relevant Information Graphics: Identifying the Independent and Dependent Axis from User-Written Queries , 2013, FLAIRS Conference.