Character-Level Alignment Using WFST and LSTM for Post-processing in Multi-script Recognition Systems - A Comparative Study

In this paper, two new techniques to correct the OCR errors are proposed, recurrent neural networks with Long-Short Term Memory (LSTM), and Weighted Finite State Transducers (WFSTs) with context-dependent confusion rules. Both methods are applied on OCR results of Latin, and Urdu Script. Especially Urdu script is very challenging to OCR. For building an error model using context-dependent confusion rules, the OCR confusions which appear in the recognition outputs are translated into edit operations using Levenshtein edit distance algorithm. The new LSTM model avoids the calculations that occur in searching the language model and it also makes the language model eligible to correct unseen incorrect words. Our generic approaches are language independent. The proposed supervised LSTM model is compared with the context-dependent error model and state-of-the-art single rule-based methods. The evaluation on Latin script shows the error rate of LSTM is 0.48 %, error model is 0.68 % and the rule-based model is 1.0 %. The evaluation shows that the accuracy of LSTM model on the Urdu testset is 1.58 %, while the accuracy of the error model is 3.8 % and OCR recognition results is 6.9 % for Urdu testset. LSTM showed best performance on both Latin and Urdu script. As such, experiments show that LSTM performs very well in language techniques, especially, post-processing.

[1]  Rafael Llobet,et al.  Efficient OCR Post-Processing Combining Language, Hypothesis and Error Models , 2010, SSPR/SPR.

[2]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[3]  Volkmar Frinken,et al.  Long-short term memory neural networks language modeling for handwriting recognition , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[4]  Mehryar Mohri Edit-distance of weighted automata , 2002, CIAA'02.

[5]  Borivoj Melichar,et al.  Finding Common Motifs with Gaps Using Finite Automata , 2006, CIAA.

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Thomas M. Breuel,et al.  Normalizing historical orthography for OCR historical documents using LSTM , 2013, HIP '13.

[8]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[9]  Ahmed Hassan Awadallah,et al.  Language Independent Text Correction using Finite State Automata , 2008, IJCNLP.

[10]  Marcus Liwicki,et al.  WFST-based ground truth alignment for difficult historical documents with text modification and layout variations , 2013, Electronic Imaging.

[11]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[12]  Edwin R. Hancock,et al.  Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshop, SSPR&SPR 2010, Cesme, Izmir, Turkey, August 18-20, 2010. Proceedings , 2010, SSPR/SPR.