Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology

This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic.

[1]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[2]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[3]  Tao Hong,et al.  Degraded text recognition using visual and linguistic context , 1996 .

[4]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[5]  Rickard Domeij,et al.  Detection of Spelling Errors in Swedish Not Using a Word List En Clair , 1994, J. Quant. Linguistics.

[6]  Douglas W. Oard,et al.  Term selection for searching printed Arabic , 2002, SIGIR '02.

[7]  Richard M. Schwartz,et al.  Robust language-independent OCR system , 1999, Other Conferences.

[8]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[9]  Douglas W. Oard,et al.  Document Image Retrieval Techniques for Chinese , 2001 .

[10]  Ossama Emam,et al.  Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval , 2005, SEMITIC@ACL.

[11]  Ricardo A. Baeza-Yates,et al.  A Faster Algorithm for Approximate String Matching , 1996, CPM.

[12]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[13]  Anne N. De Roeck,et al.  A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots , 2000, ACL.

[14]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[15]  Eneko Agirre,et al.  Towards a Single Proposal in Spelling Correction , 1998, COLING-ACL.

[16]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[17]  Julie Borsack,et al.  Expert system for automatically correcting OCR output , 1994, Electronic Imaging.