OCR error correction using a noisy channel model

In this paper, we take a pattern recognition approach to correcting errors in text generated from printed documents using optical character recognition (OCR). We apply a very general, theoretically optimal model to the problem of OCR word correction, introduce practical methods for parameter estimation, and evaluate performance on real data.

[1]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[2]  B. John Oommen,et al.  A formal theory for optimal and information theoretic syntactic pattern recognition , 1998, Pattern Recognit..

[3]  Philip Resnik,et al.  The Bible, Truth, and Multilingual Optical Character Recognition , 2004 .

[4]  Douglas W. Oard,et al.  Translation lexicon acquisition from bilingual dictionaries , 2001, IS&T/SPIE Electronic Imaging.

[5]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[10]  Douglas W. Oard,et al.  Document Image Retrieval Techniques for Chinese , 2001 .

[11]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[12]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[13]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[14]  Kevin Knight A Statistical MT Tutorial Workbook , 2003 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Eric Brill,et al.  Automatic Rule Acquisition for Spelling Correction , 1997, ICML.