OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optical Character Recognition was conceived to translate paper-based books into digital e-books. Regrettably, OCR systems are still erroneous and inaccurate as they produce misspellings in the recognized text, especially when the source document is of low printing quality. This paper proposes a post-processing OCR context-sensitive error correction method for detecting and correcting non-word and real-word OCR errors. The cornerstone of this proposed approach is the use of Google Web 1T 5-gram data set as a dictionary of words to spell-check OCR text. The Google data set incorporates a very large vocabulary and word statistics entirely reaped from the Internet, making it a reliable source to perform dictionary-based error correction. The core of the proposed solution is a combination of three algorithms: The error detection, candidate spellings generator, and error correction algorithms, which all exploit information extracted from Google Web 1T 5-gram data set. Experiments conducted on scanned images written in different languages showed a substantial improvement in the OCR error correction rate. As future developments, the proposed algorithm is to be parallelised so as to support parallel and distributed computing architectures.

[1]  Lon-Mu Liu,et al.  Adaptive post-processing of OCR text via knowledge acquisition , 1991, CSC '91.

[2]  Yasuharu Shimeki,et al.  Postprocessing for Character Recognition Using Keyword Information , 1992, MVA.

[3]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[5]  Roger T. Hartley,et al.  Quality of OCR for degraded text images , 1999, DL '99.

[6]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[7]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[8]  Jonathan J. Hull Incorporating Language Syntax in Visual Text Recognition with a Statistical Model , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Jia-Guu Leu,et al.  Edge sharpening through ramp width reduction , 2000, Image Vis. Comput..

[10]  Ilya Zavorin,et al.  A filter based post-OCR accuracy boost system , 2004, HDP '04.

[11]  Christian Piguet,et al.  Microprocessor design , 1997 .

[12]  Michael L. Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[13]  Réjean Plamondon,et al.  Normalizing and restoring on-line handwriting , 1993, Pattern Recognit..

[14]  Roland Doron,et al.  Dictionnaire de Psychologie , 1991 .

[15]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Lon-Mu Liu,et al.  Intelligent OCR Processing , 1992, J. Am. Soc. Inf. Sci..

[18]  V. John Mathews,et al.  Adaptive, quadratic preprocessing of document images for binarization , 1998, IEEE Trans. Image Process..

[19]  Ching Y. Suen,et al.  Character Recognition Systems: A Guide for Students and Practitioners , 2007 .

[20]  Koichi Kise,et al.  A method of post-processing for character recognition based on syntactic and semantic analysis of sentences , 1996, Systems and Computers in Japan.

[21]  C. Y. Suen,et al.  Optimal local weighted averaging methods in contour smoothing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Isabelle Guyon,et al.  Design of a linguistic postprocessor using variable memory length Markov models , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.