Cross-language Information Retrieval with Explicit Semantic Analysis

We have participated on the monolingual and bilingual CLEF Ad-Hoc Retrieval Tasks, using a novel extension of the by now well-known Explicit Semantic Analysis (ESA) approach. We call this extension Cross-Language Explicit Semantic Analysis (CL-ESA) as it allows to apply ESA in a cross-lingual information retrieval setting. In essence, ESA represents documents as vectors in the space of Wikipedia articles, using the tfidf measure to capture how “important” a Wikipedia article is for a specific word. The interesting property of ESA is that arbitrary documents can be represented as a vector with respect to the Wikipedia article space. ESA thus replaces the standard BOW model for retrieval. In our cross-lingual extension of ESA, the cross-language links of Wikipedia are used in order to map the ESA vectors between different languages, thus allowing retrieval across languages. Our results are far behind the ones of other systems on the monolingual and ad-hoc retrieval tasks, but our motivation was to find out the potential of the CL-ESA approach using a first and unoptimized implementation thereof.

[1]  Evgeniy Gabrilovich,et al.  Concept-Based Feature Generation and Selection for Information Retrieval , 2008, AAAI.

[2]  Greg A. Keim,et al.  Cross-Language Text RetrievalWith Three , 1997 .

[3]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[4]  David A. Evans,et al.  The Effect of Pseudo Relevance Feedback on MT-Based CLIR , 2000, RIAO.

[5]  Evgeniy Gabrilovich,et al.  Feature generation for textual information retrieval using world knowledge , 2007, SIGF.

[6]  Douglas W. Oard,et al.  The effect of bilingual term list size on dictionary-based cross-language information retrieval , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[7]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[8]  András A. Benczúr,et al.  Performing Cross-Language Retrieval with Wikipedia , 2007, CLEF.

[9]  Michael W. Berry,et al.  Using latent semantic indexing for multilanguage information retrieval , 1995, Comput. Humanit..

[10]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[11]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[12]  Rakesh Gupta,et al.  Text Categorization with Knowledge Transfer from Heterogeneous Data Sources , 2008, AAAI.

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[15]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .