Adapting Morphology for Arabic Information Retrieval

This chapter presents an adaptation of existing techniques in Arabic morphology by leveraging corpus statistics to make them suitable for Information Retrieval (IR). The adaptation resulted in the development of Sebawai, an shallow Arabic morphological analyzer, and Al-Stem, an Arabic light stemmer. Both were used to produce Arabic index terms for Arabic IR experimentation. Sebawai is concerned with generating possible roots and stems of given Arabic word along with probability estimates of deriving the word from each of the possible roots. The probability estimates were used a guide to determine which prefixes and suffixes should be used to build the light stemmer Al-Stem. The use of the Sebawai generated roots and stems as index terms along with the stems from Al-Stem are evaluated in an information retrieval application and the results are compared.

[1]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[2]  Ophir Frieder,et al.  IIT at TREC-10 , 2001, TREC.

[3]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[4]  Fredric C. Gey,et al.  Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval , 2001, TREC.

[5]  Martha W. Evens,et al.  Stemming methodologies over individual query words for an Arabic information retrieval system , 1999 .

[6]  Martha W. Evens,et al.  Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System , 1994, J. Am. Soc. Inf. Sci..

[7]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[8]  Douglas W. Oard,et al.  CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval , 2002, TREC.

[9]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[10]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[11]  Martha W. Evens,et al.  Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..

[12]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[13]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[14]  James Allan,et al.  UMass at TREC 2002: Cross Language and Novelty Tracks , 2002, TREC.

[15]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[16]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[17]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[18]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[19]  Ismail Hmeidi,et al.  Design and implementation of automatic indexing for information retrieval with Arabic documents , 1997 .

[20]  Christine D. Piatko,et al.  JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval , 2001, TREC.

[21]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[22]  Douglas W. Oard,et al.  Term selection for searching printed Arabic , 2002, SIGIR '02.

[23]  George Anton Kiraz,et al.  Arabic Computational Morphology in the West , 1998 .