Language variation as a context for information retrieval

Speakers of widespread languages may encounter problems in information retrieval and document understanding when they access documents in the same language from another country. The work described here focuses on the development of resources to support improved document retrieval and understanding by users of Modern Standard Arabic (MSA). The lexicon of an Egyptian Arabic speaker and the lexicon of an Algerian Arabic speaker overlap, but there are many lexical tokens which are not shared, or which mean different things to the two speakers. These differences give us a context for information retrieval which can improve retrieval performance and also enhance document understanding after retrieval. The availability of a suitable corpus is a key for much objective research. In this paper we present the results of experiments in building a corpus for Modern Standard Arabic (MSA) using data available on the World Wide Web. We selected samples of online published newspapers from different Arabic countries. We demonstrate the completeness and the representativeness of this corpus using standard metrics and show its suitability for Language engineering experiments. The results of the experiments show that is possible to link an Arabic document to a specific region based on information induced from its vocabu-

[1]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[2]  Wiebke Walther,et al.  Studies in Modern Arabic Prose and Poetry , 1988 .

[3]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[4]  A. Roeck,et al.  Assessment of a Significant Arabic Corpus , 2001 .

[5]  P. Kaszubski Corpora in Applied Linguistics , 2003 .

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[8]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[9]  Ahmed Abdelali,et al.  Arabic Information Retrieval Perspectives , 2004 .

[10]  Ahmed Abdelali Localization in Modern Standard Arabic , 2004, J. Assoc. Inf. Sci. Technol..

[11]  Martha W. Evens,et al.  Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System , 1994, J. Am. Soc. Inf. Sci..

[12]  Jaroslav Stetkevych,et al.  The Modern Arabic Literary Language: Lexical and Stylistic Developments , 1970 .

[13]  Ron Zacharski,et al.  Language Recognition for Mono-and Multi-lingual Documents , 1999 .

[14]  Ismail Hmeidi,et al.  Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents , 1997, J. Am. Soc. Inf. Sci..

[15]  Ron Zacharski,et al.  Multilingual Document Language Recognition for , 1999 .

[16]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[17]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[18]  C. Meyer English Corpus Linguistics An Introduction , 2002 .

[19]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.