An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts

OCR (Optical Character Recognition) scanners do not always produce 100% accuracy in recognizing text documents, leading to spelling errors that make the texts hard to process further. This paper presents an investigation for the task of spell checking for OCR-scanned text documents. First, we conduct a detailed analysis on characteristics of spelling errors given by an OCR scanner. Then, we propose a fully automatic approach combining both error detection and correction phases within a unique scheme. The scheme is designed in an unsupervised & data-driven manner, suitable for resource-poor languages. Based on the evaluation on real dataset in Vietnamese language, our approach gives an acceptable performance (detection accuracy 86%, correction accuracy 71%). In addition, we also give a result analysis to show how accurate our approach can achieve.

[1]  Masaaki Nagata Context-Based Spelling Correction for Japanese OCR , 1996, COLING.

[2]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[3]  Victoria J. Hodge,et al.  A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Yuen-Hsien Tseng Error correction in a Chinese OCR test collection , 2002, SIGIR '02.

[5]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[6]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[7]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[8]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[9]  Boonserm Kijsirikul,et al.  Combining Trigram and Winnow in Thai OCR Error Correction , 1998, COLING.

[10]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[11]  Walid Magdy,et al.  Effect of OCR error correction on Arabic retrieval , 2008, Information Retrieval.

[12]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[13]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[14]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[15]  Masaaki Nagata Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model , 1998, COLING-ACL.

[16]  Chunheng Wang,et al.  A Chinese OCR spelling check approach based on statistical language models , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[17]  Diana Inkpen,et al.  Real-Word Spelling Correction using Google Web 1T 3-grams , 2009, EMNLP.

[18]  Walid Magdy,et al.  Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology , 2006, EMNLP.