Empirical studies in strategies for Arabic retrieval

This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Arabic corpus of nearly 400k documents with both monolingual and cross-lingual queries and relevance judgments has been a new enabler for empirical studies. Experimental results show that spelling normalization and stemming can significantly improve Arabic monolingual retrieval. Character tri-grams from stems improved retrieval modestly on the test corpus, but the improvement is not statistically significant. To further improve retrieval, we propose a novel thesaurus-based technique. Different from existing approaches to thesaurus-based retrieval, ours formulates word synonyms as probabilistic term translations that can be automatically derived from a parallel corpus. Retrieval results show that the thesaurus can significantly improve Arabic monolingual retrieval. For cross-lingual retrieval (CLIR), we found that spelling normalization and stemming have little impact.

[1]  Kui-Lam Kwok,et al.  TREC-9 Cross Language, Web and Question-Answering Track Experiments using PIRCS , 2000, TREC.

[2]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[3]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[6]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[7]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[8]  Ophir Frieder,et al.  IIT at TREC-10 , 2001, TREC.

[9]  Christine D. Piatko,et al.  JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval , 2001, TREC.

[10]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[11]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[12]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[13]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[14]  J. Scott McCarley Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[15]  David J. Goodman,et al.  Personal Communications , 1994, Mobile Communications.

[16]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[17]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[18]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[19]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[20]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[21]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[22]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  Martha W. Evens,et al.  Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..