论文信息 - Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm

Arabic Corpus Enhancement using a New Lexicon/Stemming Algorithm

Optical Character Recognition (OCR) is an important technology and has many advantages in storing information for both old and new documents. The Arabic language lacks both the variety of OCR systems and the depth of research relative to Roman scripts. An authoritative corpus is beneficial in the design and construction of any OCR system. Lexicon and stemming tools are essential in enhancing corpus retrieval and performance in an OCR context. A new lexicon/stemming algorithm is presented based on the Viterbi path method which uses a light stemmer approach. Lexicon and stemming lookup is combined to obtain a list of alternatives for uncertain words. This list removes affixes (prefixes or suffices) if there are any; otherwise affixes are added. Finally, every word in the list of alternatives is verified by searching the original corpus. The lexicon/stemming algorithm also assures the continuous updating of the contents of the corpus presented by (AbdelRaouf et al., 2010), which copes with the innovative needs of Arabic OCR

Tony P. Pridmore | Mahmoud I. Khalil | Colin Higgins | Ashraf AbdelRaouf

[1] Riyad Al-Shalabi,et al. Constructing An Automatic Lexicon for Arabic Language , 2005 .

[2] Ophir Frieder,et al. On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[3] Mahmoud I. Khalil,et al. A Database for Arabic Printed Character Recognition , 2008, ICIAR.

[4] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[5] Lisa Ballesteros,et al. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[6] Kalina Bontcheva,et al. Architectural elements of language engineering robustness , 2002, Natural Language Engineering.

[7] 名倉秀人,et al. British National Corpus-XML Editionに関する一考察 , 2015 .

[8] Tony P. Pridmore,et al. Building a multi-modal Arabic corpus (MMAC) , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[9] Yiming Yang,et al. Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[10] Martha W. Evens,et al. Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System , 1994, J. Am. Soc. Inf. Sci..

[11] Riyad Al-Shalabi,et al. A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.