Novel Ontologies-based Optical Character Recognition-error Correction Cooperating with Graph Component Extraction

literature. Extracting graph information clearly contributes to readers, who are interested in graph information interpretation, because we can obtain significant information presenting in the graph. A typical tool used to transform image-based characters to computer editable characters is optical character recognition (OCR). Unfortunately, OCR cannot guarantee perfect results, because it is sensitive to noise and input quality. This becomes a serious problem because misrecognition provides misunderstanding information to readers and causes misleading communication. In this study, we present a novel method for OCR-error correction based on bar graphs using semantics, such as ontologies and dependency parsing. Moreover, we used a graph component extraction proposed in our previous study to omit irrelevant parts from graph components. It was applied to clean and prepare input data for this OCR-error correction. The main objectives of this paper are to extract significant information from the graph using OCR and to correct OCR errors using semantics. As a result, our method provided remarkable performance with the highest accuracies and F-measures. Moreover, we examined that our input data contained less of noise because of an efficiency of our graph component extraction. Based on the evidence, we conclude that our solution to the OCR problem achieves the objectives.

[1]  Thomas A. Lasko,et al.  Approximate string matching algorithms for limited-vocabulary OCR output correction , 2000, IS&T/SPIE Electronic Imaging.

[2]  Chung-Hao Chen,et al.  Outdoor Scene Image Segmentation Based on Background Recognition and Perceptual Organization , 2012, IEEE Transactions on Image Processing.

[3]  Chew Lim Tan,et al.  Associating text and graphics for scientific chart understanding , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[5]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[6]  Yann LeCun,et al.  Road Scene Segmentation from a Single Image , 2012, ECCV.

[7]  Michael L. Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[8]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[9]  Masaaki Nagata Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model , 1998, COLING-ACL.

[10]  Xiaoyan Zhu,et al.  An OCR Post-processing Approach Based on Multi-knowledge , 2005, KES.

[11]  Jean-Marc Odobez,et al.  Text detection, recognition in images and video frames , 2004, Pattern Recognit..

[12]  Kamal Kant Hiran,et al.  An Artificial Neural Network Approach for Brain Tumor Detection Using Digital Image Segmentation , 2013 .