Exploring a Topical Representation of Documents for Recommendation Systems

In this paper, we address the performance problems inherited when we use word embedding for recommendation. Free-text documents has no structural constructing rules, and are hard to model. Hence, the problem of having an accurate model, that conveys all the important information is a nontrivial problem. We convert the document to a numeric structure using word-embedding and test two document representations: one based in the center of this numeric representation and the other one based on pre-defined set of topics. We build a free text recommendation system and study how the performance, in terms of precision and recommendation time, is affected by both representations. We then vary the number of topics used to represent documents and verify the tradeoffs inherited from having a compact representation. The more compact the recommendation, the shorter the recommendation time, however more information is lost in the compactation process. We empirically test different possibilities for the topics and find an optimal point that is 3 times faster than a baseline and almost as accurate as it.

[1]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[2]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[3]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[6]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[9]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[10]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .