Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries

Most systems for correcting errors in texts make use of specific word distance measures such as the Levenshtein distance. In many experiments it has been shown that correction accuracy is improved when using edit weights that depend on the particular symbols of the edit operation. However, most proposed approaches so far rely on high amounts of training data where errors and their corrections are collected. In practice, the preparation of suitable ground truth data is often too costly, which means that uniform edit costs are used. In this paper we evaluate approaches for deriving symbol dependent edit weights that do not need any ground truth training data, comparing them with methods based on ground truth training. We suggest a new approach where special error dictionaries are used to estimate weights. The method is simple and very efficient, needing one pass of the document to be corrected. Our experiments with different OCR systems and textual data show that the method consistently improves correction accuracy in a significant way, often leading to results comparable to those achieved with ground truth training.

[1]  Achim Weigel,et al.  Lexical postprocessing by heuristic search and automatic determination of the edit costs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  George Nagy,et al.  Optical character recognition: an illustrated guide to the frontier , 1999, Electronic Imaging.

[3]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[5]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[6]  Tamotsu Kasai,et al.  A Method for the Correction of Garbled Words Based on the Levenshtein Metric , 1976, IEEE Transactions on Computers.

[7]  B. John Oommen Recognition of Noisy Subsequences Using Constrained Edit Distances , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Daniel P. Lopresti,et al.  Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Thomas A. Lasko,et al.  Approximate string matching algorithms for limited-vocabulary OCR output correction , 2000, IS&T/SPIE Electronic Imaging.

[10]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[11]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[12]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[13]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[14]  Achim Weigel,et al.  Estimation of probabilities for edit operations , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[15]  János Csirik,et al.  Inference of edit costs using parametric string matching , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .