A Comparison of Approaches for Automated Text Extraction from Scholarly Figures

So far, there has not been a comparative evaluation of different approaches for text extraction from scholarly figures. In order to fill this gap, we have defined a generic pipeline for text extraction that abstracts from the existing approaches as documented in the literature. In this paper, we use this generic pipeline to systematically evaluate and compare 32 configurations for text extraction over four datasets of scholarly figures of different origin and characteristics. In total, our experiments have been run over more than 400 manually labeled figures. The experimental results show that the approach BS-4OS results in the best F-measure of 0.67 for the Text Location Detection and the best average Levenshtein Distance of 4.71 between the recognized text and the gold standard on all four datasets using the Ocropy OCR engine.

[1]  Craig A. Knoblock,et al.  Recognizing text in raster maps , 2014, GeoInformatica.

[2]  Michael Krauthammer,et al.  A new pivoting and iterative text detection algorithm for biomedical images , 2010, J. Biomed. Informatics.

[3]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[4]  Chew Lim Tan,et al.  Semi-automatic Ground Truth Generation for Chart Image Recognition , 2006, Document Analysis Systems.

[5]  Muhammad Fraz,et al.  Exploiting colour information for better scene text detection and recognition , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[6]  Richard E. Ladner,et al.  Automated tactile graphics translation: in the field , 2007, Assets '07.

[7]  Chew Lim Tan,et al.  Associating text and graphics for scientific chart understanding , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[8]  Zhao Jiuzhou,et al.  Creation of Synthetic Chart Image Database with Ground Truth , 2006 .

[9]  Hanan Samet,et al.  Efficient Component Labeling of Images of Arbitrary Dimension Represented by Linear Bintrees , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[11]  C. Lee Giles,et al.  An Architecture for Information Extraction from Figures in Digital Libraries , 2015, WWW.

[12]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[14]  Nicole Vincent,et al.  Comparison of Niblack inspired binarization methods for ancient documents , 2009, Electronic Imaging.

[15]  Jerzy Sas,et al.  Three-Stage Method of Text Region Extraction from Diagram Raster Images , 2013, CORES.

[16]  Ansgar Scherp,et al.  Multi-oriented Text Extraction from Information Graphics , 2015, DocEng.

[17]  Stephanie Elzer Schwartz,et al.  Information graphics: an untapped resource for digital libraries , 2006, SIGIR.

[18]  Ansgar Scherp,et al.  Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics , 2015, LWA.

[19]  Craig A. Knoblock,et al.  A general approach for extracting road vector data from raster maps , 2013, International Journal on Document Analysis and Recognition (IJDAR).