Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary?

Postcorrection of OCR-results for text documents is usuallybased on electronic dictionaries. When scanning textsfrom a specific thematic area, conventional dictionaries oftenmiss a considerable number of tokens. Furthermore,if word frequencies are stored with the entries, these frequencieswill not properly reflect the frequencies found inthe given thematic area. Correction adequacy suffers fromthese two shortcomings. We report on a series of experimentswhere we compare (1) the use of fixed, static large-scaledictionaries (including proper names and abbreviations)with (2) the use of dynamic dictionaries retrieved viaan automated analysis of the vocabulary of web pages froma given domain, and (3) the use of mixed dictionaries. Ourexperiments, which address English and German documentcollections from a variety of fields, show that dynamic dictionariesof the above mentioned form can improve the coveragefor the given thematic area in a significant way andhelp to improve the quality of lexical postcorrection methods.

[1]  Sargur N. Srihari,et al.  A word shape analysis approach to lexicon based word recognition , 1992, Pattern Recognit. Lett..

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Proceedings Seventh International Conference on Document Analysis and Recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Rainer Hoch,et al.  TECHNIQUES FOR IMPROVING OCR RESULTS , 1997 .

[5]  Achim Weigel,et al.  Lexical postprocessing by heuristic search and automatic determination of the edit costs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[7]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[8]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[9]  Rainer Hoch,et al.  On virtual partitioning of large dictionaries for contextual post-processing to improve character recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).