Using Topic Models in Content-Based News Recommender Systems

We study content-based recommendation of Finnish news in a system with a very small group of users. We compare three standard methods, Naive Bayes (NB), K-Nearest Neighbor (kNN) Regression and Regulairized Linear Regression in a novel online simulation setting and in a coldstart simulation. We also apply Latent Dirichlet Allocation (LDA) on the large corpus of news and compare the learned features to those found by Singular Value Decomposition (SVD). Our results indicate that Naive Bayes is the worst of the three models. K-Nearest Neighbor performs consistently well across input features. Regularized Linear Regression performs generally worse than kNN, but reaches similar performance as kNN with some features. Regularized Linear Regression gains statistically significant improvements over the word-features with LDA both on the full data set and in the cold-start simulation. In the cold-start simulation we find that LDA gives statistically significant improvements for all the methods.

[1]  John Riedl,et al.  Learning preferences of new users in recommender systems: an information theoretic approach , 2008, SKDD.

[2]  Juan M. Fernández-Luna,et al.  Top-N news recommendations in digital newspapers , 2012, Knowl. Based Syst..

[3]  Michael J. Pazzani,et al.  User Modeling for Adaptive News Access , 2000, User Modeling and User-Adapted Interaction.

[4]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[5]  Charles H. Davis American Society for Information Science , 1984 .

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[9]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[10]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  David M. Pennock,et al.  Categories and Subject Descriptors , 2001 .

[13]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Domonkos Tikk,et al.  Alternating least squares for personalized ranking , 2012, RecSys.

[16]  Yiyu Yao,et al.  Measuring Retrieval Effectiveness Based on User Preference of Documents , 1995, J. Am. Soc. Inf. Sci..

[17]  Tommi A. Pirinen,et al.  HFST - Framework for Compiling and Applying Morphologies , 2011, SFCM.

[18]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[20]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.