Error Detection and Corrections in Indic OCR Using LSTMs

Conventional approaches to spell checking suggest spelling corrections using proximity-based matches to a known vocabulary. For highly inflectional Indian languages, any off-the-shelf vocabulary is significantly incomplete, since a large fraction of words in Indic documents are generated using word conjoining rules. Therefore, a tremendous manual effort is needed in spell-correcting words in Indic OCR documents. Moreover, in a spell checking system, a vocabulary may suggest multiple alternatives to the incorrect word. The ranking of these corrective suggestions is improved using language models. Owing to corpus resource scarcity, however, Indian languages lack reliable language models. Thus, learning the character (or n-gram) confusions or error patterns of the OCR system can be helpful in correcting the Out of Vocabulary (OOV) words in OCR documents. We adopt a Long Short-Term Memory (LSTM) based character level language model with a fixed delay for discriminative language modeling in the context of OCR errors for jointly addressing the problems of error detection and correction in Indic OCR. For words that need not be corrected in the OCR output, our model simply abstains from suggesting any changes. We present extensive results to validate the performance of our model on four Indian languages with different inflectional complexities. We achieve F-Scores above 92.4% and decreases in Word Error Rates (WER) of at least 26.7% across the four languages.

[1]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[2]  R. Manmatha,et al.  A Fast Alignment Scheme for Automatic OCR Evaluation of Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[3]  Kent Fitch,et al.  Correcting noisy OCR: context beats confusion , 2014, DATeCH '14.

[4]  Niloy Ganguly,et al.  How Difficult is it to Develop a Perfect Spell-checker? A Cross-Linguistic Analysis through Complex Network Approach , 2007, physics/0703198.

[5]  C. V. Jawahar,et al.  A post-processing scheme for malayalam using statistical sub-character language models , 2010, DAS '10.

[6]  Nachum Dershowitz,et al.  OCR Error Correction Using Character Correction and Feature-Based Word Classification , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[7]  Chandan Singh,et al.  A shape based post processor for Gurmukhi OCR , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[9]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[10]  C. V. Jawahar,et al.  Error Detection in Indic OCRs , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[11]  C. V. Jawahar,et al.  Error Detection in Highly Inflectional Languages , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[12]  Youssef Bassil,et al.  OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set , 2012, ArXiv.

[13]  Ray Smith Limits on the Application of Frequency-Based Language Models to OCR , 2011, 2011 International Conference on Document Analysis and Recognition.

[14]  Mickaël Coustaty,et al.  ICDAR2017 Competition on Post-OCR Text Correction , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[15]  Ching Y. Suen,et al.  Character Recognition Systems: A Guide for Students and Practitioners , 2007 .

[16]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[17]  Bidyut Baran Chaudhuri,et al.  OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..

[18]  Andrew Carlson,et al.  Memory-based context-sensitive spelling correction at web scale , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[19]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.