论文信息 - Empirical studies in strategies for Arabic retrieval - 字舞流文

Empirical studies in strategies for Arabic retrieval

This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Arabic corpus of nearly 400k documents with both monolingual and cross-lingual queries and relevance judgments has been a new enabler for empirical studies. Experimental results show that spelling normalization and stemming can significantly improve Arabic monolingual retrieval. Character tri-grams from stems improved retrieval modestly on the test corpus, but the improvement is not statistically significant. To further improve retrieval, we propose a novel thesaurus-based technique. Different from existing approaches to thesaurus-based retrieval, ours formulates word synonyms as probabilistic term translations that can be automatically derived from a parallel corpus. Retrieval results show that the thesaurus can significantly improve Arabic monolingual retrieval. For cross-lingual retrieval (CLIR), we found that spelling normalization and stemming have little impact.

Alexander M. Fraser | Jinxi Xu | Ralph M. Weischedel | R. Weischedel | Jinxi Xu

[1] Kui-Lam Kwok,et al. TREC-9 Cross Language, Web and Question-Answering Track Experiments using PIRCS , 2000, TREC.

[2] John D. Lafferty,et al. Information retrieval as statistical translation , 1999, SIGIR '99.

[3] Hermann Ney,et al. Improved Statistical Alignment Models , 2000, ACL.

[4] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5] Donna K. Harman,et al. How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[6] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[7] Richard M. Schwartz,et al. A hidden Markov model information retrieval system , 1999, SIGIR '99.

[8] Ophir Frieder,et al. IIT at TREC-10 , 2001, TREC.

[9] Christine D. Piatko,et al. JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval , 2001, TREC.

[10] Djoerd Hiemstra,et al. Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[11] W. Bruce Croft,et al. An Association Thesaurus for Information Retrieval , 1994, RIAO.

[12] Jean Paul Ballerini,et al. Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[13] Jinxi Xu,et al. Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[14] J. Scott McCarley. Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[15] David J. Goodman,et al. Personal Communications , 1994, Mobile Communications.

[16] Chris Buckley,et al. New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[17] Hinrich Schütze,et al. A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[18] Kenneth R. Beesley. Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[19] Fredric C. Gey,et al. The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[20] David A. Hull. Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[21] Karen Sparck Jones. Automatic keyword classification for information retrieval , 1971 .

[22] Jian-Yun Nie,et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[23] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[24] Martha W. Evens,et al. Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..