Two bigrams based language model for auto correction of Arabic OCR errors
暂无分享,去创建一个
In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text.