A Theoretical Analysis of Cross-lingual Semantic Relatedness in Vector Space Models

Semantic relatedness is essential for different text processing tasks, especially in the cross-lingual setting due to the vocabulary mismatch problem. Many concept-based solutions to semantic relatedness have been proposed, which vary in the notions of concept and document representation. In our contribution, we provide a unified model that generalizes over the existing approaches to cross-lingual semantic relatedness. It shows that the main existing solutions represent different ways for constructing the concept space, which result in different document representations and implications for semantic relatedness computation. In particular, it al- lows us to provide theoretical justifications of existing solutions. Through the experimental evaluation, we show that the results support our theoretical findings.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[5]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[6]  Iryna Gurevych,et al.  Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval , 2008, CLEF.

[7]  M. Littman,et al.  A Comparison of Two Corpus-Based Methods for Translingual Information Retrieval , 2000 .

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[10]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[11]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[12]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[13]  Benno Stein,et al.  The ESA retrieval model revisited , 2009, SIGIR.

[14]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[15]  Benno Stein,et al.  Insights into explicit semantic analysis , 2011, CIKM '11.

[16]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[17]  Philipp Cimiano,et al.  Cross-language Information Retrieval with Explicit Semantic Analysis , 2008, CLEF.

[18]  Max Mühlhäuser,et al.  Integrating Semantic Knowledge into Text Similarity and Information Retrieval , 2007, International Conference on Semantic Computing (ICSC 2007).

[19]  Steffen Staab,et al.  Explicit Versus Latent Concept Models for Cross-Language Information Retrieval , 2009, IJCAI.

[20]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[21]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.