Post-OCR Error Detection by Generating Plausible Candidates

The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error detector using different features from character-level (including character noisy channel, index of peculiarity) to word-level (such as frequencies of n-grams, skip-grams, part-of-speech) Experimental results show that our approach outperforms the best performing techniques in the ICDAR 2017 Competition on Post-OCR text correction.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Evangelos E. Milios,et al.  Statistical Learning for OCR Text Correction , 2016, ArXiv.

[3]  Mickaël Coustaty,et al.  Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction , 2018, ICADL.

[4]  Gitansh Khirbat OCR Post-Processing Text Correction using Simulated Annealing (OPTeCA) , 2017, ALTA.

[5]  Kazem Taghva,et al.  MANICURE document processing system , 1998, Electronic Imaging.

[6]  Paolo Rosso,et al.  A multidimensional approach for detecting irony in Twitter , 2013, Lang. Resour. Evaluation.

[7]  Jonas Kuhn,et al.  Multi-modular domain-tailored OCR post-correction , 2017, EMNLP.

[8]  Brigham Young The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010 .

[9]  R. Morris,et al.  Computer detection of typographical errors , 1975, IEEE Transactions on Professional Communication.

[10]  Mickaël Coustaty,et al.  ICDAR 2019 Competition on Post-OCR Text Correction , 2017, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[11]  Jacques Wainer,et al.  Comparison of 14 different families of classification algorithms on 115 binary datasets , 2016, ArXiv.

[12]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[13]  Kazem Taghva,et al.  Post-Editing Through Approximation and Global Correction , 1995, Int. J. Pattern Recognit. Artif. Intell..

[14]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[15]  Mickaël Coustaty,et al.  ICDAR2017 Competition on Post-OCR Text Correction , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[16]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[17]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.