Utilizing web data in identification and correction of OCR errors

In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.

[1]  Erik G. Learned-Miller,et al.  Using a Probabilistic Syllable Model to Improve Scene Text Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Mindy Bokser,et al.  Omnidocument technologies , 1992, Proc. IEEE.

[3]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[4]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[5]  Koichi Kise,et al.  A method of post-processing for character recognition based on syntactic and semantic analysis of sentences , 1996, Systems and Computers in Japan.

[6]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[7]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[8]  Jonathan J. Hull Incorporating Language Syntax in Visual Text Recognition with a Statistical Model , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Rong Jin,et al.  Information retrieval for OCR documents: a content-based probabilistic correction model , 2003, IS&T/SPIE Electronic Imaging.

[10]  Klaus U. Schulz,et al.  Adaptive text correction with Web-crawled domain-dependent dictionaries , 2007, TSLP.

[11]  Klaus U. Schulz,et al.  Precise and Efficient Text Correction using Levenshtein Automata , Dynamic Web Dictionaries and Optimized Correction Models , .

[12]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[13]  Kazem Taghva,et al.  MANICURE document processing system , 1998, Electronic Imaging.

[14]  Youssef Bassil,et al.  OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set , 2012, ArXiv.

[15]  Chafic Mokbel,et al.  Handwritten word preprocessing for database adaptation , 2013, Electronic Imaging.

[16]  L. Vincent Google Book Search: Document Understanding on a Massive Scale , 2007 .

[17]  Y. T. Feng BOSTON Public Library. , 1953, The New England journal of medicine.

[18]  Martin Reynaert Parallel identification of the spelling variants in corpora , 2009, AND '09.

[19]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[20]  Kazem Taghva,et al.  Post processing with first- and second-order hidden Markov models , 2013, Electronic Imaging.

[21]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[22]  Martin Reynaert,et al.  All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[23]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[24]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[25]  Yasuharu Shimeki,et al.  Postprocessing for Character Recognition Using Keyword Information , 1992, MVA.

[26]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[27]  Karen Kukich,et al.  Spelling correction for the telecommunications network for the deaf , 1992, CACM.

[28]  Klaus U. Schulz,et al.  Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[29]  Kazem Taghva,et al.  The Effects of OCR Error on the Extraction of Private Information , 2006, Document Analysis Systems.

[30]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[31]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[32]  Ching Y. Suen,et al.  Character Recognition Systems: A Guide for Students and Practitioners , 2007 .

[33]  Daniel P. Lopresti,et al.  Classification and distribution of optical character recognition errors , 1994, Electronic Imaging.