Design, Construction and Validation of an Arabic-English Conceptual Interlingua for Cross-lingual Information Retrieval

This paper describes the issues involved in extending a trans-lingual lexicon, the TextWise Conceptual Interlingua (CI), with Arabic terms. The Conceptual Interlingua is based on the Princeton English WordNet (Fellbaum, 1998). It is a central component in the cross-lingual information retrieval (CLIR) system CINDOR (Conceptual INterlingua for DOcument Retrieval). Arabic has a rich morphological system combining templatic and affixational paradigms for both inflectional and derivational morphology. This rich morphology poses a major challenge to the design and building of the Arabic CI and also its validation. This is because the available resources for Arabic, whether manually constructed bilingual lexicons or lexicons automatically derived from bilingual parallel corpora, exist at different levels of morphological representation. We describe here the issues and decisions made in the design and construction of the Arabic-English CI using different types of manual and automatic resources. We also present the results of an extensive validation of the Arabic CI and briefly discuss the evaluation of its use for CLIR on the TREC Arabic Benchmark collection.

[1]  Miguel E. Ruiz,et al.  CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation , 1999, TREC.

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[4]  Miguel E. Ruiz,et al.  CINDOR TREC-9 English-Chinese Evaluation , 2000, TREC.

[5]  Nizar Habash,et al.  Large Scale Lexeme Based Arabic Morphological Generation , 2004 .

[6]  Mona T. Diab Feasibility of Bootstrapping an Arabic WordNet Leveraging Parallel Corpora and an English WordNet , 2022 .

[7]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[8]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[9]  Ali Farghaly,et al.  Roots & patterns vs. stems plus grammar-lexis specifications: on what basis should a multilingual database centred on Arabic be built? , 2003, MTSUMMIT.

[10]  Hadj Ahmed Cherkaoui,et al.  A Computational Lexeme-Based Treatment of Arabic Morphology , 2001 .

[11]  Christiane Fellbaum,et al.  Introducing the Arabic WordNet project , 2006 .

[12]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  Wim Peters,et al.  The Multilingual design of the EuroWordNet Database , 1997 .

[15]  Wim Peters,et al.  Multilingual design of EuroWordNet , 1997, ACL 1997.

[16]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[17]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[18]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.