A visual and interactive tool for optimizing lexical postcorrection of OCR results

Systems for postcorrection of OCR-results can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of an appropriate distance measure for strings and the choice of a scoring function for ranking distinct correction alternatives. When combining the results obtained from distinct OCR engines, further parameters have to be fixed. Due to all these degrees of freedom, adaption and fine tuning of systems for lexical postcorrection is a difficult process. Here we describe a visual and interactive tool that semi-automates the generation of ground truth data, partially automates adjustment of parameters, yields active support for error analysis and thus helps to find correction strategies that lead to high accuracy with realistic effort.

[1]  Ching Y. Suen,et al.  Combination of multiple classifiers with measurement values , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Achim Weigel,et al.  Lexical postprocessing by heuristic search and automatic determination of the edit costs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[3]  Sargur N. Srihari,et al.  On multiple classifier systems for pattern recognition , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[4]  Ke Chen,et al.  Methods of Combining Multiple Classifiers with Different Features and Their Applications to Text-Independent Speaker Identification , 1997, Int. J. Pattern Recognit. Artif. Intell..

[5]  Ching Y. Suen,et al.  Combination of multiple classifier decisions for optical character recognition , 1997 .

[6]  Rainer Hoch,et al.  On Virtual Partitioning of Large Dictionaries for Contextual Post-Processing to Improve Character Recognition , 1996, Int. J. Pattern Recognit. Artif. Intell..

[7]  Klaus U. Schulz,et al.  Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[9]  Sargur N. Srihari,et al.  A theory of classifier combination: the neural network approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[10]  Daniel X. Le,et al.  Pattern matching techniques for correcting low-confidence OCR words in a known context , 2000, IS&T/SPIE Electronic Imaging.

[11]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[12]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[15]  T. Ho A theory of multiple classifier systems and its application to visual word recognition , 1992 .

[16]  Thomas A. Lasko,et al.  Approximate string matching algorithms for limited-vocabulary OCR output correction , 2000, IS&T/SPIE Electronic Imaging.

[17]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[18]  Rainer Hoch,et al.  TECHNIQUES FOR IMPROVING OCR RESULTS , 1997 .

[19]  Ching Y. Suen,et al.  The Combination of Multiple Classifiers by A Neural Network Approach , 1995, Int. J. Pattern Recognit. Artif. Intell..