Design and Implementation of OCR Correction Model for Numeric Digits based on a Context Sensitive and Multiple Streams

On an automated business document processing system maintaining financial data, errors on query based retrieval of numbers are critical to overall performance and usability of the system. Automatic spelling correction methods have been emerged and have played important role in development of information retrieval system. However scope of the methods was limited to the symbols, for example alphabetic letter strings, which can be reserved in the form of trainable templates or custom dictionary. On the other hand, numbers, a sequence of digits, are not the objects that can be reserved into a dictionary but a pure markov sequence. In this paper we proposed a new OCR model for spelling correction for numbers using the multiple streams and the context based correction on top of probabilistic information retrieval framework. We implemented the proposed error correction model as a sub-module and integrated into an existing automated invoice document processing system. We also presented the comparative test results that indicated significant enhancement of overall precision of the system by our model.

[1]  Andrea Cavallaro,et al.  Content and task-based view selection from multiple video streams , 2009, Multimedia Tools and Applications.

[2]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[3]  Cecilia Mascolo Specification, analysis and prototyping of mobile code systems , 2001 .

[4]  Masashi Koga,et al.  Camera-based Kanji OCR for mobile-phones: practical issues , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  David S. Doermann,et al.  Automatic text detection and tracking in digital video , 2000, IEEE Trans. Image Process..

[6]  Asunción Gómez-Pérez,et al.  Ontology-based legal information retrieval to improve the information access in e-government , 2006, WWW '06.

[7]  Edwina L. Rissland,et al.  A hybrid CBR-IR approach to legal information retrieval , 1995, ICAIL '95.

[8]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[9]  Bhabatosh Chanda,et al.  Machine reading of camera-held low quality text images: An ICA-based image enhancement approach for improving OCR accuracy , 2008, 2008 19th International Conference on Pattern Recognition.

[10]  Chew Lim Tan,et al.  Improving OCR text categorization accuracy with electronic abstracts , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[11]  Utpal Garain,et al.  Improvement of OCR Accuracy by Similar Character Pair Discrimination: an Approach based on Artificial Immune System , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[12]  George Nagy,et al.  Prototype Extraction and Adaptive OCR , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  C. V. Jawahar,et al.  Robust Recognition of Documents by Fusing Results of Word Clusters , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[14]  Frank Lebourgeois,et al.  Document Images Restoration by a New Tensor Based Diffusion Process: Application to the Recognition of Old Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Hadar I. Avi-Itzhak,et al.  High Accuracy Optical Character Recognition Using Neural Networks with Centroid Dithering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Kang Ryoung Park,et al.  Super-Resolution Iris Image Restoration using Single Image for Iris Recognition , 2010, KSII Trans. Internet Inf. Syst..

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..