Adaptive text correction with Web-crawled domain-dependent dictionaries

For the success of lexical text correction, high coverage of the underlying background dictionary is crucial. Still, most correction tools are built on top of static dictionaries that represent fixed collections of expressions of a given language. When treating texts from specific domains and areas, often a significant part of the vocabulary is missed. In this situation, both automated and interactive correction systems produce suboptimal results. In this article, we describe strategies for crawling Web pages that fit the thematic domain of the given input text. Special filtering techniques are introduced to avoid pages with many orthographic errors. Collecting the vocabulary of filtered pages that meet the vocabulary of the input text, dynamic dictionaries of modest size are obtained that reach excellent coverage values. A tool has been developed that automatically crawls dictionaries in the indicated way. Our correction experiments with crawled dictionaries, which address English and German document collections from a variety of thematic fields, show that with these dictionaries even the error rate of highly accurate texts can be reduced, using completely automated correction methods. For interactive text correction, more sensible candidate sets for correcting erroneous words are obtained and the manual effort is reduced in a significant way. To complete this picture, we study the effect when using word trigram models for correction. Again, trigram models from crawled corpora outperform those obtained from static corpora.

[1]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[2]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[3]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[4]  Rainer Hoch,et al.  On Virtual Partitioning of Large Dictionaries for Contextual Post-Processing to Improve Character Recognition , 1996, Int. J. Pattern Recognit. Artif. Intell..

[5]  Frederick Jelinek,et al.  Recognition performance of a structured language model , 2000, EUROSPEECH.

[6]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[7]  Jon Louis Bentley,et al.  Programming pearls: a spelling checker , 1985, CACM.

[8]  Rainer Hoch,et al.  TECHNIQUES FOR IMPROVING OCR RESULTS , 1997 .

[9]  James L. Peterson,et al.  A note on undetected typing errors , 1986, CACM.

[10]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[11]  Sargur N. Srihari,et al.  A word shape analysis approach to lexicon based word recognition , 1992, Pattern Recognit. Lett..

[12]  James L. Peterson Spelling checker , 2003 .

[13]  Achim Weigel,et al.  Lexical postprocessing by heuristic search and automatic determination of the edit costs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[14]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[15]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[16]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[17]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[18]  Alexander I. Rudnicky,et al.  Stochastic Language Generation for Spoken Dialogue Systems , 2000 .

[19]  Fred J. Damerau,et al.  An examination of undetected typing errors , 1989, Inf. Process. Manag..

[20]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[21]  Klaus U. Schulz,et al.  Fast Approximate Search in Large Dictionaries , 2004, CL.

[22]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[23]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[24]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[25]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[26]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[27]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[28]  Gregory Grefenstette Very Large Lexicons , 2000, CLIN.

[29]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[30]  Klaus U. Schulz,et al.  Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..