Post processing with first- and second-order hidden Markov models

In this paper, we present the implementation and evaluation of first order and second order Hidden Markov Models to identify and correct OCR errors in the post processing of books. Our experiments show that the first order model approximately corrects 10% of the errors with 100% precision, while the second order model corrects a higher percentage of errors with much lower precision.

[1]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[2]  Martin Reynaert Parallel identification of the spelling variants in corpora , 2009, AND '09.

[3]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[4]  Karen Kukich,et al.  Spelling correction for the telecommunications network for the deaf , 1992, CACM.

[5]  Paramvir Bahl,et al.  Recognition of handwritten word: first and second order hidden Markov model based approach , 1988, Proceedings CVPR '88: The Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[7]  Kazem Taghva,et al.  The Effects of OCR Error on the Extraction of Private Information , 2006, Document Analysis Systems.

[8]  Yang He Extended Viterbi algorithm for second order hidden Markov process , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.