Exploiting Embedding in Content-Based Recommender systems

XING is a leading career-oriented social networking site in Europe, which usually recommend job ads to their customers. One of the widely used methods in Recomender Systems is content-based filtering, which analyzes the description of item characteristics and the user profile illustrating user's preferences. Due to the sparsity of its dataset, i.e. many job postings are rarely interacted with, XING has been using content-based recommender system to promote the quality of the recommendations. Recent word embedding technique learns semantically meaningful representations for words from co-occurrence in sentences, which enables the effective comparison between words. Based on the Word2Vec technique, XING represents job postings by the average embedding over words they contain. This study explores three alternative methods to represent job postings for the task of recommending jobs to users. In the first experiment, we explore whether the use of a subset of words is more effective to represent the job postings. In the second experiment, instead of averaging over word embeddings, we directly learn document embeddings using Paragraph2Vec. And finally, the third experiment uses Word Mover's Distance to estimate the similarity between job postings. Our experiments show that the embeddings that are learned with Paragraph2Vec result in a better estimation of which job postings are similar, but only when high-dimensional settings are used. The Word Mover's Distance algorithm is computationally expensive, therefore we use existing lower-bounds that allowed us to complete a small-scale experiment within the available time. The results indicate that Word Mover's Distance is not as effective as the average over word embeddings and Paragraph2Vec. In the final part of this thesis, we present the Link2Vec, a novel item representation method based on Word2Vec, which learns semantic representations for items based on the context surrounding the hyperlinks that refer to the item, e.g. hyperlinks to the item's Wikipedia page. Our experiments show that the effectiveness of the embeddings learned with Link2Vec improves with the amount of training data. For the evaluation on the MovieLens dataset, we only obtained a limited set of hyperlinks, which resulted in results that approximate a baseline that uses the average over word embeddings.

[1]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[2]  Dan Frankowski,et al.  Collaborative Filtering Recommender Systems , 2007, The Adaptive Web.

[3]  Jaap Kamps,et al.  The importance of anchor text for ad hoc search revisited , 2010, SIGIR '10.

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[6]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[7]  Pasquale Lops,et al.  Learning Word Embeddings from Wikipedia for Content-Based Recommender Systems , 2016, ECIR.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[10]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[11]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[12]  Pasquale Lops,et al.  Word Embedding Techniques for Content-based Recommender Systems: An Empirical Evaluation , 2015, RecSys Posters.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[16]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).