论文信息 - Post-OCR Error Detection by Generating Plausible Candidates

Post-OCR Error Detection by Generating Plausible Candidates

The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error detector using different features from character-level (including character noisy channel, index of peculiarity) to word-level (such as frequencies of n-grams, skip-grams, part-of-speech) Experimental results show that our approach outperforms the best performing techniques in the ICDAR 2017 Competition on Post-OCR text correction.

Mickaël Coustaty | Antoine Doucet | Adam Jatowt | Nhu-Van Nguyen | Thi-Tuyet-Hai Nguyen

[1] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2] Evangelos E. Milios,et al. Statistical Learning for OCR Text Correction , 2016, ArXiv.

[3] Mickaël Coustaty,et al. Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction , 2018, ICADL.

[4] Gitansh Khirbat. OCR Post-Processing Text Correction using Simulated Annealing (OPTeCA) , 2017, ALTA.

[5] Kazem Taghva,et al. MANICURE document processing system , 1998, Electronic Imaging.

[6] Paolo Rosso,et al. A multidimensional approach for detecting irony in Twitter , 2013, Lang. Resour. Evaluation.

[7] Jonas Kuhn,et al. Multi-modular domain-tailored OCR post-correction , 2017, EMNLP.

[8] Brigham Young. The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010 .

[9] R. Morris,et al. Computer detection of typographical errors , 1975, IEEE Transactions on Professional Communication.

[10] Mickaël Coustaty,et al. ICDAR 2019 Competition on Post-OCR Text Correction , 2017, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[11] Jacques Wainer,et al. Comparison of 14 different families of classification algorithms on 115 binary datasets , 2016, ArXiv.

[12] Kazem Taghva,et al. OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[13] Kazem Taghva,et al. Post-Editing Through Approximation and Global Correction , 1995, Int. J. Pattern Recognit. Artif. Intell..

[14] Elena M. Zamora,et al. The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[15] Mickaël Coustaty,et al. ICDAR2017 Competition on Post-OCR Text Correction , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[16] Yves Schabes,et al. Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[17] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.