论文信息 - Using topic models for OCR correction

Using topic models for OCR correction

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

Venu Govindaraju | Faisal Farooq | Anurag Bhardwaj

[1] Venu Govindaraju,et al. Phrase-based correction model for improving handwriting recognition accuracies , 2009, Pattern Recognit..

[2] Kazem Taghva,et al. OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[3] Robert J. Price. Accurate Document Categorization of OCR-Generated Text , .

[4] Horst Bunke,et al. Automatic bankcheck processing , 1997 .

[5] Samy Bengio,et al. Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] K. Yamada,et al. WORD LEXICON REDUCTION BY CHARACTER SPOTTING , 2004 .

[7] Robert Sabourin,et al. Large vocabulary off-line handwriting recognition: A survey , 2003, Pattern Analysis & Applications.

[8] Bidyut Baran Chaudhuri,et al. OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..

[9] Venu Govindaraju,et al. Holistic lexicon reduction for handwritten word recognition , 1996, Electronic Imaging.

[10] Gyeonghwan Kim,et al. A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11] Nasser Sherkat,et al. Word shape analysis for a hybrid recognition system , 1997, Pattern Recognit..

[12] Venu Govindaraju,et al. Syntactic methodology of pruning large lexicons in cursive script recognition , 2001, Pattern Recognit..

[13] Horst Bunke,et al. Lexicon reduction in an framework based on quantized feature vectors , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[14] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[15] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[16] Adwait Ratnaparkhi,et al. A Simple Introduction to Maximum Entropy Models for Natural Language Processing , 1997 .

[17] Rafael Llobet,et al. Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[18] Gyeonghwan Kim,et al. An architecture for handwritten text recognition systems , 1999, International Journal on Document Analysis and Recognition.

[19] Anthony J. Robinson,et al. An Off-Line Cursive Handwriting Recognition System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20] Horst Bunke,et al. The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[21] Michael L. Wick,et al. Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[22] Sargur N. Srihari,et al. Integration of hand-written address interpretation technology into the United States Postal Service Remote Computer Reader system , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[23] Venu Govindaraju,et al. Reading handwritten US census forms , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[24] Andrew McCallum,et al. Using Maximum Entropy for Text Classification , 1999 .

[25] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[26] Yiming Yang,et al. An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[27] Venu Govindaraju,et al. Phrase Based Direct Model for Improving Handwriting Recognition Accuracies , 2008 .

[28] Horst Bunke,et al. SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE , 2007 .

[29] Venu Govindaraju,et al. A lexicon reduction strategy in the context of handwritten medical forms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).