论文信息 - Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks

Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks

Lexical text correction relies on a central step where approximate search in a dictionary is used to select the best correction suggestions for an ill-formed input token. In previous work we introduced the concept of a universal Levenshtein automaton and showed how to use these automata for efficiently selecting from a dictionary all entries within a fixed Levenshtein distance to the garbled input word. In this paper we look at refinements of the basic Levenshtein distance that yield more sensible notions of similarity in distinct text correction applications, e.g. OCR. We show that the concept of a universal Levenshtein automaton can be adapted to these refinements. In this way we obtain a method for selecting correction candidates which is very efficient, at the same time selecting small candidate sets with high recall.

Klaus U. Schulz | Stoyan Mihov | Petar Mitankin

[1] Daniel P. Lopresti,et al. Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2] Ellen M. Voorhees,et al. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[3] Michael J. Fischer,et al. The String-to-String Correction Problem , 1974, JACM.

[4] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5] D. R. McGregor,et al. Fast approximate string matching , 1988, Softw. Pract. Exp..

[6] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[7] George Nagy,et al. Optical character recognition: an illustrated guide to the frontier , 1999, Electronic Imaging.

[8] Klaus U. Schulz,et al. Fast Approximate Search in Large Dictionaries , 2004, CL.

[9] Klaus U. Schulz,et al. Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[10] Klaus U. Schulz,et al. A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[11] Justin Zobel,et al. Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..