Statistical biases in Information Retrieval metrics for recommender systems

There is an increasing consensus in the Recommender Systems community that the dominant error-based evaluation metrics are insufficient, and mostly inadequate, to properly assess the practical effectiveness of recommendations. Seeking to evaluate recommendation rankings—which largely determine the effective accuracy in matching user needs—rather than predicted rating values, Information Retrieval metrics have started to be applied for the evaluation of recommender systems. In this paper we analyse the main issues and potential divergences in the application of Information Retrieval methodologies to recommender system evaluation, and provide a systematic characterisation of experimental design alternatives for this adaptation. We lay out an experimental configuration framework upon which we identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases. These biases considerably distort the empirical measurements, hindering the interpretation and comparison of results across experiments. We develop a formal characterisation and analysis of the biases upon which we analyse their causes and main factors, as well as their impact on evaluation metrics under different experimental configurations, illustrating the theoretical findings with empirical evidence. We propose two experimental design approaches that effectively neutralise such biases to a large extent. We report experiments validating our proposed experimental variants, and comparing them to alternative approaches and metrics that have been defined in the literature with similar or related purposes.

[1]  Martha Larson,et al.  CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering , 2012, RecSys.

[2]  Pablo Castells,et al.  A Probabilistic Reformulation of Memory-Based Collaborative Filtering: Implications on Popularity Biases , 2017, SIGIR.

[3]  Klaas Bosteels,et al.  Music Recommendation and the Long Tail , 2010 .

[4]  Patrick Gallinari,et al.  Ranking with non-random missing ratings: influence of popularity and positivity on evaluation metrics , 2012, RecSys.

[5]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[6]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[7]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[8]  Nicola Barbieri,et al.  Modeling item selection and relevance for accurate recommendations: a bayesian approach , 2011, RecSys '11.

[9]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[10]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[11]  Ellen M. Voorhees,et al.  Bias and the limits of pooling , 2006, SIGIR '06.

[12]  Pedro Cano,et al.  From hits to niches?: or how popular artists can bias music recommendation and discovery , 2008, NETFLIX '08.

[13]  C. J. van Rijsbergen,et al.  Towards an information logic , 1989, SIGIR '89.

[14]  Òscar Celma,et al.  Music Recommendation and Discovery - The Long Tail, Long Fail, and Long Play in the Digital Music Space , 2010 .

[15]  Alejandro Bellogín,et al.  Precision-oriented evaluation of recommender systems: an algorithmic comparison , 2011, RecSys '11.

[16]  Jun Wang,et al.  Optimizing multiple objectives in collaborative filtering , 2010, RecSys '10.

[17]  Li Chen,et al.  CoFiSet: Collaborative Filtering via Learning Pairwise Preferences over Item-sets , 2013, SDM.

[18]  Martha Larson,et al.  Personalized Landmark Recommendation Based on Geotags from Photo Sharing Sites , 2011, ICWSM.

[19]  Kartik Hosanagar,et al.  Recommender systems and their impact on sales diversity , 2007, EC '07.

[20]  Joseph A. Konstan,et al.  Evaluating recommender behavior for new users , 2014, RecSys '14.

[21]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[22]  William W. Cohen,et al.  Recommendation as Classification: Using Social and Content-Based Information in Recommendation , 1998, AAAI/IAAI.

[23]  Jun Wang,et al.  Goal-Driven Collaborative Filtering - A Directional Error Based Approach , 2010, ECIR.

[24]  Yu Xin,et al.  A Generalized Probabilistic Framework and its Variants for Training Top-k Recommender System , 2010, PRSAT@RecSys.

[25]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[26]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[27]  Yehuda Koren,et al.  Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition , 2008, KDD 2008.

[28]  Guy Shani,et al.  Evaluating Recommendation Systems , 2011, Recommender Systems Handbook.

[29]  Harald Steck,et al.  Item popularity and recommendation accuracy , 2011, RecSys '11.

[30]  Pablo Castells,et al.  Exploring Social Network Effects on Popularity Biases in Recommender Systems , 2014, RSWeb@RecSys.

[31]  Alistair Moffat,et al.  Has adhoc retrieval improved since 1994? , 2009, SIGIR.

[32]  Òscar Celma,et al.  A new approach to evaluating novel recommendations , 2008, RecSys '08.

[33]  Saul Vargas,et al.  Rank and relevance in novelty and diversity metrics for recommender systems , 2011, RecSys '11.

[34]  Guy Shani,et al.  Mining recommendations from the web , 2008, RecSys '08.

[35]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[36]  Roberto Turrin,et al.  Performance of recommender algorithms on top-n recommendation tasks , 2010, RecSys '10.

[37]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[38]  Dietmar Jannach,et al.  What recommenders recommend: an analysis of recommendation biases and possible countermeasures , 2015, User Modeling and User-Adapted Interaction.

[39]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[40]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.