Text Correction Using Domain Dependent Bigram Models from Web Crawls

The quality of text correction systems can be improved when using complex language models and by taking peculiarities of the garbled input text into account. We report on a series of experiments where we crawl domain dependent web corpora for a given garbled input text. From crawled corpora we derive dictionaries and language models, which are used to correct the input text. We show that correction accuracy is improved when integrating word bigram frequency values from the crawls as a new score into a baseline correction strategy based on word similarity and word (unigram) frequencies. In a second series of experiments we compare the quality of distinct language models, measuring how closely these models reect the frequencies observed in a given input text. It is shown that crawled language models are superior to language models obtained from standard corpora.

[1]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[2]  Sargur N. Srihari From pixels to paragraphs: The use of contextual models in text recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  Klaus U. Schulz,et al.  Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Kirk Lubbes Proceedings of the 1st ACM workshop on Hardcopy document processing , 2004, CIKM 2004.

[7]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[8]  Rong Jin,et al.  Information retrieval for OCR documents: a content-based probabilistic correction model , 2003, IS&T/SPIE Electronic Imaging.

[9]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[10]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[11]  Kazem Taghva,et al.  OCR correction based on document level knowledge , 2003, IS&T/SPIE Electronic Imaging.

[12]  Rainer Hoch,et al.  TECHNIQUES FOR IMPROVING OCR RESULTS , 1997 .

[13]  Tao Hong,et al.  Degraded Text Recognition Using Word Collocation and Visual Inter-Word Constraints , 1994, ANLP.

[14]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[15]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[16]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[17]  Bong-Rae Park,et al.  A Contextual Post-processing Model for Korean OCR usingSynthesized Statistical , 2007 .

[18]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[19]  James R. Curran,et al.  Web Text Corpus for Natural Language Processing , 2006, EACL.

[20]  正好 長谷川 Information Processing and Management:[8]Patent Information , 1984 .

[21]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[22]  Rohini K. Srihari,et al.  Incorporating Syntactic Constraints in Recognizing Handwritten Sentences , 1993, IJCAI.

[23]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[24]  Tao Hong,et al.  Degraded text recognition using word collocation , 1994, Electronic Imaging.

[25]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.