A Wikipedia-Based Multilingual Retrieval Model

This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document di* chosen from the "L-subset" of Wikipedia. Likewise, for a second document d′ written in language L′, L ≠ L′, we construct a concept vector d′, using from the L′-subset of the Wikipedia the topic-aligned counterparts d′i* of our previously chosen documents. Since the two concept vectors d and d′ are collection-relative representations of d and d′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.

[1]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[3]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[4]  Bruno Pouliquen,et al.  Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[5]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[6]  Tony McEnery,et al.  Chapter 2. Parallel and Comparable Corpora: What is Happening? , 2007 .

[7]  Evgeniy Gabrilovich,et al.  Feature generation for textual information retrieval using world knowledge , 2007, SIGF.

[8]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[9]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[10]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[11]  Susan T. Dumais,et al.  Automatic 3-Language Cross-Language Information Retrieval with Latent Semantic Indexing , 1997, TREC.

[12]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[13]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[14]  Bruno Pouliquen,et al.  Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications , 2006, ArXiv.

[15]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language information retrieval: a dictionary approach , 2001 .

[16]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[17]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.