Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content

This paper proposes a new method of supplementing the context of short sentences for the training phase of Doc2Vec. Since CGM (Consumer Generated Media) sites and SNS sites become widespread, the importance of similarity calculation between a given query and a short sentence is increasing. As an example, a search by the query "sad" should find actual expressions such as "I needed a handkerchief" on a movie review site. Doc2Vec is one of the most widely used methods for vectorization of queries and sentences. However, Doc2Vec often exhibits low accuracy if the training data consists of short sentences, because they lack context. We modified Doc2Vec with the hypothesis that other posts for the same topic (i.e. reviews for the same movie in online movie review sites) may share the same background. Our method uses target-topic IDs instead of sentence IDs as the context in the training phase of the Doc2Vec with the PV-DM model; this model estimates the next term from a few previous terms and context. The model trained with item IDs vectorizes a sentence more accurately than a model trained with sentence IDs. We conducted a large-scale experiment using 1.2 million movie review posts and a crowdsourcing-based evaluation. The experimental result demonstrates that our new method achieves higher precision and nDCG than previous Doc2Vec variants and traditional topic modeling methods.

[1]  Paul A. Pavlou,et al.  Can online reviews reveal a product's true quality?: empirical findings and analytical modeling of Online word-of-mouth communication , 2006, EC '06.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Alice H. Oh,et al.  Aspect and sentiment unification model for online review analysis , 2011, WSDM '11.

[4]  M. de Rijke,et al.  Mix 'n Match: Integrating Text Matching and Product Substitutability within Product Search , 2018, CIKM.

[5]  P. Waila,et al.  Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification , 2013, 2013 International Mutli-Conference on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s).

[6]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[7]  Shigeki Matsubara,et al.  A product retrieval system robust to subjective queries , 2007, 2007 2nd International Conference on Digital Information Management.

[8]  Dong-Hong Ji,et al.  PARL: Let Strangers Speak Out What You Like , 2018, CIKM.

[9]  Joel Lanir,et al.  Exploring Emotions in Online Movie Reviews for Online Browsing , 2017, IUI Companion.

[10]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Alexander Kotov,et al.  Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization , 2017, CIKM.

[13]  Minh-Triet Tran,et al.  News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion , 2017, SoICT.