Context-Dependent Confusions Rules for Building Error Model Using Weighted Finite State Transducers for OCR Post-Processing

In this paper, we propose a new technique to correct the OCR errors by means of weighted finite state transducers(WFST) with context-dependent confusion rules. We translate the OCR confusions which appear in the recognition outputs into edit operations, e.g. insertions, deletions and substitutions using Levenshtein edit distance algorithm. The edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using weighted finite state transducers. The context-dependent rules help to fit the rule in the appropriate strings. Our new error model avoids the calculations that occur in searching the language model and it also makes the language model eligible to correct incorrect words by using context-dependent confusion rules. Our approach is language independent. It designed to deal with different number of errors. It has no limited words size. In the set of experiments conducted on the ocred pages from the UWIII dataset, our new proposed error model outperforms. The evaluation shows the error rate of our model on the UWIII testset is 0.68%, while the baseline is 1.14% and the error rate of the existing state-of-the-art single character rules-based approach is 1.0%.

[1]  Marcus Liwicki,et al.  WFST-based ground truth alignment for difficult historical documents with text modification and layout variations , 2013, Electronic Imaging.

[2]  Tommi A. Pirinen,et al.  Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams , 2012, CICLing 2012.

[3]  Rafael Llobet,et al.  Efficient OCR Post-Processing Combining Language, Hypothesis and Error Models , 2010, SSPR/SPR.

[4]  Isabelle Guyon,et al.  DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[5]  Cyril Allauzen,et al.  Linear-Space Computation of the Edit-Distance between a String and a Finite Automaton , 2009, ArXiv.

[6]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[7]  Tommi A. Pirinen,et al.  Effect of Language and Error Models on Efficiency of Finite-State Spell-Checking and Correction , 2012, FSMNLP.

[8]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[9]  Klaus U. Schulz,et al.  Text Correction Using Domain Dependent Bigram Models from Web Crawls , 2007 .

[10]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[13]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[14]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[15]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[16]  Shankar Kumar,et al.  A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation , 2003, NAACL.

[17]  Mehryar Mohri Edit-distance of weighted automata , 2002, CIAA'02.

[18]  Ahmed Hassan Awadallah,et al.  Language Independent Text Correction using Finite State Automata , 2008, IJCNLP.

[19]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.