Using topic models for OCR correction

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

[1]  Venu Govindaraju,et al.  Phrase-based correction model for improving handwriting recognition accuracies , 2009, Pattern Recognit..

[2]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[3]  Robert J. Price Accurate Document Categorization of OCR-Generated Text , .

[4]  Horst Bunke,et al.  Automatic bankcheck processing , 1997 .

[5]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  K. Yamada,et al.  WORD LEXICON REDUCTION BY CHARACTER SPOTTING , 2004 .

[7]  Robert Sabourin,et al.  Large vocabulary off-line handwriting recognition: A survey , 2003, Pattern Analysis & Applications.

[8]  Bidyut Baran Chaudhuri,et al.  OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..

[9]  Venu Govindaraju,et al.  Holistic lexicon reduction for handwritten word recognition , 1996, Electronic Imaging.

[10]  Gyeonghwan Kim,et al.  A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Nasser Sherkat,et al.  Word shape analysis for a hybrid recognition system , 1997, Pattern Recognit..

[12]  Venu Govindaraju,et al.  Syntactic methodology of pruning large lexicons in cursive script recognition , 2001, Pattern Recognit..

[13]  Horst Bunke,et al.  Lexicon reduction in an framework based on quantized feature vectors , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[16]  Adwait Ratnaparkhi,et al.  A Simple Introduction to Maximum Entropy Models for Natural Language Processing , 1997 .

[17]  Rafael Llobet,et al.  Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[18]  Gyeonghwan Kim,et al.  An architecture for handwritten text recognition systems , 1999, International Journal on Document Analysis and Recognition.

[19]  Anthony J. Robinson,et al.  An Off-Line Cursive Handwriting Recognition System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[21]  Michael L. Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[22]  Sargur N. Srihari,et al.  Integration of hand-written address interpretation technology into the United States Postal Service Remote Computer Reader system , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[23]  Venu Govindaraju,et al.  Reading handwritten US census forms , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[24]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[25]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[26]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[27]  Venu Govindaraju,et al.  Phrase Based Direct Model for Improving Handwriting Recognition Accuracies , 2008 .

[28]  Horst Bunke,et al.  SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE , 2007 .

[29]  Venu Govindaraju,et al.  A lexicon reduction strategy in the context of handwritten medical forms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).