A Generative Probabilistic OCR Model for NLP Applications

In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model's ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.

[1]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[2]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[3]  R. Mahesh K. Sinha,et al.  Visual text recognition through contextual processing , 1988, Pattern Recognit..

[4]  A. Ardeshir Goshtasby,et al.  Contextual word recognition using probabilistic relaxation labeling , 1988, Pattern Recognit..

[5]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[6]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[7]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[8]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[9]  Allen R. Hanson,et al.  Context in word recognition , 1976, Pattern Recognition.

[10]  Philip Resnik,et al.  Semi-Automatic Acquisition of Domain-Specific Translation Lexicons , 1997, ANLP.

[11]  Douglas W. Oard,et al.  Improved Cross-Language Retrieval using Backoff Translation , 2001, HLT.

[12]  Rafael Llobet,et al.  Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[13]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .

[14]  Bidyut Baran Chaudhuri,et al.  OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..

[15]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[16]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[17]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[18]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[19]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[20]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[21]  Douglas W. Oard,et al.  Translation lexicon acquisition from bilingual dictionaries , 2001, IS&T/SPIE Electronic Imaging.

[22]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[23]  Sargur N. Srihari,et al.  Integrating diverse knowledge sources in text recognition , 1982, TOIS.