Semantic Embedding for Information Retrieval

Capturing semantics in a computable way is desirable for many applications, such as information retrieval, document clustering or classification, etc. Embedding words or documents in a vector space is a common first-step. Different types of embedding techniques have their own characteristics which makes it difficult to choose one for an application. In this paper, we compared a few off-the-shelf word and document embedding methods with our own Ariadne approach in different evaluation tests. We argue that one needs to take into account the specific requirements from the applications to decide which embedding method is more suitable. Also, in order to achieve better retrieval performance, it is worth investigating the combination of bibliometric measures with semantic embedding to improve ranking.

[1]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[2]  W. N. Locke,et al.  Machine Translation of Languages , 1956 .

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Magnus Sahlgren,et al.  The Distributional Hypothesis , 2008 .

[6]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[9]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[10]  Susan T. Dumais,et al.  Statistical semantics: analysis of the potential performance of keyword information systems , 1984 .

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[13]  Andrea Scharnhorst,et al.  Contextualization of topics: browsing through the universe of bibliographic information , 2017, Scientometrics.

[14]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[15]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[16]  Andrea Scharnhorst,et al.  Contextualization of Topics - Browsing through Terms, Authors, Journals and Cluster Allocations , 2015, ISSI.

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.