On Temporally Sensitive Word Embeddings for News Information Retrieval

Word embedding is one of the hot issues in recent natural language processing (NLP) and information retrieval (IR) research because it has a potential to represent text at a semantic level. Current word embedding methods take advantage of term proximity relationships in a large corpus to generate a vector representation of a word in a semantic space. We argue that the semantic relationships among terms should change as time goes by, especially for news IR. With unusual and unprecedented events reported in news articles, for example, the word co-occurrence statistics in the time period covering the events would change non-trivially, affecting the semantic relationships of some words in the embedding space and hence news IR. With a hypothesis that news IR would benefit from changing word embeddings over time, this paper reports our initial investigation along the line. We constructed a news retrieval collection based on mobile search and conducted a retrieval experiment to compare the embeddings constructed Copyright c © 2018 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 Workshop at ECIR, Grenoble, France, 26-March-2018, published at http://ceur-ws.org from two sets of news articles covering two disjoint time spans. The collection is comprised of 500 most frequent queries and their clicked news articles in July, 2017, provided by Naver Corp. The experimental result shows there is a need for word embeddings to be built in a temporally sensitive way for news IR.

[1]  Jun Zhao,et al.  How to Generate a Good Word Embedding , 2015, IEEE Intelligent Systems.

[2]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.