论文信息 - OCR Error Correction Using Character Correction and Feature-Based Word Classification

OCR Error Correction Using Character Correction and Feature-Based Word Classification

This paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast majority of segmentation and recognition errors, the most frequent types of error on our dataset.

Nachum Dershowitz | Ido Kissos | N. Dershowitz | Ido Kissos

[1] Klaus U. Schulz,et al. Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[2] Walid Magdy,et al. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology , 2006, EMNLP.

[3] James H. Martin,et al. Speech and language processing: an introduction to natural language processing , 2000 .

[4] Richard M. Schwartz,et al. Robust language-independent OCR system , 1999, Other Conferences.

[5] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[6] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[7] Haikal El Abed,et al. Guide to OCR for Arabic Scripts , 2012, Springer London.

[8] Kent Fitch,et al. Correcting noisy OCR: context beats confusion , 2014, DATeCH '14.