Ontologies-based Optical Character Recognition-error Correction Method for Bar Graphs

Graphs provide an effective method for briefly presenting significant information appearing in academic literature. Readers can benefit from automatic graph information extraction. The conventional technique uses optical character recognition (OCR). However, OCR results can be imperfect because its performance depends on factors such as image quality. This becomes a critical problem because misrecognition provides incorrect information to readers and causes misleading communication. Numerous publications have appeared in recent years documenting OCR performance improvement and OCR result correction; however, only a few studies have focused on the use of semantics to solve this problem. In this study, we propose a novel method for OCRerror correction using several techniques, including ontologies, natural language processing, and edit distance. The input of this study includes bar graphs and associated information, such as their captions and cited paragraphs. We implemented five conditions to cover all possible situations for acquiring the most similar words as substitutes for incorrect OCR results. Moreover, we used DBpedia and WordNet to find word categories and part-of-speech tags. We evaluated our method by comparing performance rates, i.e., accuracy and precision, with our previous method using only the edit distance technique. As a result, our method provided higher performance rates than the other method. Our method’s overall accuracy reached 81%, while that of the other method was 54%. Based on the evidence, we conclude that our solution to the OCR problem is effective. KeywordsOCR-error correction; dependency parsing; ontology; edit distance; two-dimensional bar graphs.

[1]  Xiaoyan Zhu,et al.  An OCR Post-processing Approach Based on Multi-knowledge , 2005, KES.

[2]  Eric K. Ringger,et al.  Evaluating Models of Latent Document Semantics in the Presence of OCR Errors , 2010, EMNLP.

[3]  Masaaki Nagata Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model , 1998, COLING-ACL.

[4]  Masaomi Kimura,et al.  A Proposal for a Method of Graph Ontology by Automatically Extracting Relationships between Captions and X- and Y-axis Titles , 2015, KEOD.

[5]  Chew Lim Tan,et al.  Associating text and graphics for scientific chart understanding , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[6]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[7]  Minwoo Jeong,et al.  Semantic-oriented error correction for spoken query processing , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[8]  Christian Brander,et al.  Virus-Specific Immune Response in HBeAg-Negative Chronic Hepatitis B: Relationship with Clinical Profile and HBsAg Serum Levels , 2013, PloS one.

[9]  Thomas A. Lasko,et al.  Approximate string matching algorithms for limited-vocabulary OCR output correction , 2000, IS&T/SPIE Electronic Imaging.

[10]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[11]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[12]  Jean-Marc Odobez,et al.  Text detection, recognition in images and video frames , 2004, Pattern Recognit..

[13]  Oscar Corcho,et al.  Preliminary Results in Tag Disambiguation using DBpedia , 2009 .

[14]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[15]  Hsin-Hsi Chen,et al.  A Simple Method for Chinese Video OCR and Its Application to Question Answering , 2001, Int. J. Comput. Linguistics Chin. Lang. Process..