Tuning the Selection of Correction Candidates for Garbled Tokens using Error Dictionaries

In previous work, we introduced a method for efficiently selecting from a background dictionary suitable correction candidates for an malformed token of a given input text. In order to select small and meaningful candidate sets, refinements of the Levenshtein distance with restricted sets of substitutions, merges and splits were used. In these experiments, the subset of possible substitutions, merges and splits was determined via training, using ground truth data representing corrected parts of the input text. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. We obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high

[1]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[2]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[3]  Andreas Arning Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz , 1997, DISKI.

[4]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[5]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[6]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[7]  D. R. McGregor,et al.  Fast approximate string matching , 1988, Softw. Pract. Exp..

[8]  F TichyWalter The string-to-string correction problem with block moves , 1984 .

[9]  Klaus U. Schulz,et al.  Fast Approximate Search in Large Dictionaries , 2004, CL.

[10]  Klaus U. Schulz,et al.  Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries , 2007 .

[11]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[12]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .