论文信息 - An improved root extraction technique for Arabic words

An improved root extraction technique for Arabic words

Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.

May Y. Al-Nashashibi | D. Neagu | Ali A. Yaghi

[1] Fredric C. Gey,et al. Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[2] Ibrahim A. Al-Kharashi,et al. Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[3] Lisa Ballesteros,et al. Light Stemming for Arabic Information Retrieval , 2007 .

[4] Kenneth R. Beesley,et al. Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001 , 2001 .

[5] Tarek A. El-Sadany,et al. An Arabic Morphological System , 1989, IBM Syst. J..

[6] Douglas W. Oard,et al. Probabilistic methods for searching ocr-degraded arabic text , 2003 .

[7] John Alfred Haywood,et al. A new Arabic grammar of the written language , 1962 .

[8] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[9] Natheer Khasawneh,et al. Feature reduction techniques for Arabic text categorization , 2009 .

[10] Ghassan Kanaan,et al. A comparison of text-classification techniques applied to Arabic text , 2009 .

[11] Lisa Ballesteros,et al. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.