Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances.

[1]  Chunheng Wang,et al.  A Chinese OCR spelling check approach based on statistical language models , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[2]  Sebastian Deorowicz,et al.  Correcting Spelling Errors by Modelling Their Causes , 2005 .

[3]  Kazem Taghva,et al.  Utilizing web data in identification and correction of OCR errors , 2013, Electronic Imaging.

[4]  Tohru Ishizaka,et al.  Segmentation of natural images using anisotropic diffusion and linking of boundary edges , 1998, Pattern Recognit..

[5]  Davide Fossati,et al.  A Mixed Trigrams Approach for Context Sensitive Spell Checking , 2009, CICLing.

[6]  Markus Ackermann,et al.  From Spelling Correction to Text Cleaning - Using Context Information , 2007, GfKl.

[7]  Olakanmi O. Oladayo OPTICAL CHARACTER RECOGNITION OF OFF-LINE TYPED AND HANDWRITTEN ENGLISH TEXT USING MORPHOLOGICAL AND TEMPLATE MATCHING TECHNIQUES , 2014 .

[8]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[9]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[10]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[11]  Warih Maharani,et al.  Isolated Word Recognition Using Ergodic Hidden Markov Models and Genetic Algorithm , 2012 .

[12]  Aicha Eutamene Ontological Model for Character Recognition Based on Spatial Relations , 2013 .

[13]  Mohamed-Khireddine Kholladi,et al.  New Process Ontology-Based Character Recognition , 2011, MTSR.

[14]  Bidyut Baran Chaudhuri,et al.  A simple real-word error detection and correction using local word bigram and trigram , 2013, ROCLING/IJCLCLP.

[15]  Hacene Belhadef,et al.  Ontology of Graphemes for Latin Character Recognition , 2011 .

[16]  Martha W. Evens,et al.  Spelling Correction using Context , 1998, ACL.

[17]  Hicham Gueddah Introduction of the weight edition errors in the Levenshtein distance , 2012, ArXiv.

[18]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[19]  Yousfi Abdellah,et al.  For an Independent Spell-Checking System from the Arabic Language Vocabulary , 2014 .

[20]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[21]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[22]  Tommi A. Pirinen,et al.  State-of-the-Art in Weighted Finite-State Spell-Checking , 2014, CICLing.