Popularity Bias in False-positive Metrics for Recommender Systems Evaluation

We investigate the impact of popularity bias in false-positive metrics in the offline evaluation of recommender systems. Unlike their true-positive complements, false-positive metrics reward systems that minimize recommendations disliked by users. Our analysis is, to the best of our knowledge, the first to show that false-positive metrics tend to penalise popular items, the opposite behavior of true-positive metrics—causing a disagreement trend between both types of metrics in the presence of popularity biases. We present a theoretical analysis of the metrics that identifies the reason that the metrics disagree and determines rare situations where the metrics might agree—the key to the situation lies in the relationship between popularity and relevance distributions, in terms of their agreement and steepness—two fundamental concepts we formalize. We then examine three well-known datasets using multiple popular true- and false-positive metrics on 16 recommendation algorithms. Specific datasets are chosen to allow us to estimate both biased and unbiased metric values. The results of the empirical study confirm and illustrate our analytical findings. With the conditions of the disagreement of the two types of metrics established, we then determine under which circumstances true-positive or false-positive metrics should be used by researchers of offline evaluation in recommender systems.1

[1]  George Karypis,et al.  SLIM: Sparse Linear Methods for Top-N Recommender Systems , 2011, 2011 IEEE 11th International Conference on Data Mining.

[2]  Judy Kay,et al.  Recommending people to people The nature of reciprocal recommenders with a case study in online dating , 2012 .

[3]  Harald Steck,et al.  Training and testing of recommender systems on data missing not at random , 2010, KDD.

[4]  Ram Ramamoorthy,et al.  Proceedings of the 23rd Conference on User Modelling, Adaptation and Personalization (UMAP-15) , 2015 .

[5]  C. J. van Rijsbergen,et al.  Special issue on model design, formulation and explanation in information retrieval using mathematics , 2006, Inf. Process. Manag..

[6]  Han Zhang,et al.  Are Bad Reviews Always Stronger than Good? Asymmetric Negativity Bias in the Formation of Online Consumer Trust , 2010, ICIS.

[7]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[8]  Ivan V. Oseledets,et al.  Fifty Shades of Ratings: How to Benefit from a Negative Feedback in Top-N Recommendations Tasks , 2016, RecSys.

[9]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[10]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[11]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[12]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[13]  Tim Pohle,et al.  Dynamic Playlist Generation Based on Skipping Behavior , 2005, ISMIR.

[14]  Andrew Trotman,et al.  Sound and complete relevance assessment for XML retrieval , 2008, TOIS.

[15]  Andrei Z. Broder,et al.  To swing or not to swing: learning when (not) to advertise , 2008, CIKM '08.

[16]  Emine Yilmaz,et al.  Estimating average precision when judgments are incomplete , 2007, Knowledge and Information Systems.

[17]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[19]  Joseph A. Konstan,et al.  Introduction to recommender systems: Algorithms and Evaluation , 2004, TOIS.

[20]  Harald Steck,et al.  Item popularity and recommendation accuracy , 2011, RecSys '11.

[21]  Shuk Ying Ho,et al.  Examining the effects of malfunctioning personalized services on online users' distrust and behaviors , 2013, Decis. Support Syst..

[22]  Alejandro Bellogín,et al.  Statistical biases in Information Retrieval metrics for recommender systems , 2017, Information Retrieval Journal.

[23]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[24]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[25]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[26]  Thomas Nedelec,et al.  Offline A/B Testing for Recommender Systems , 2018, WSDM.

[27]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[28]  Erkki Oja,et al.  Projective Nonnegative Matrix Factorization for Image Compression and Feature Extraction , 2005, SCIA.

[29]  Lars Schmidt-Thieme,et al.  Personalized Ranking for Non-Uniformly Sampled Items , 2012, KDD Cup.

[30]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[31]  Alan Hanjalic,et al.  List-wise learning to rank with matrix factorization for collaborative filtering , 2010, RecSys '10.

[32]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[33]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[34]  Alistair Moffat,et al.  Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics , 2018, ACM J. Data Inf. Qual..

[35]  Yehuda Koren,et al.  Collaborative filtering with temporal dynamics , 2009, KDD.

[36]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[37]  Pablo Castells,et al.  Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems , 2018, SIGIR.

[38]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.

[39]  Allan Hanbury,et al.  Splitting Water: Precision and Anti-Precision to Reduce Pool Bias , 2015, SIGIR.

[40]  Yiqun Liu,et al.  Effects of User Negative Experience in Mobile News Streaming , 2019, SIGIR.

[41]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[42]  George Karypis,et al.  A Comprehensive Survey of Neighborhood-based Recommendation Methods , 2011, Recommender Systems Handbook.

[43]  Rishabh Mehrotra,et al.  The Music Streaming Sessions Dataset , 2018, WWW.

[44]  Mark Sanderson,et al.  Agreement and Disagreement between True and False-Positive Metrics in Recommender Systems Evaluation , 2020, SIGIR.

[45]  David M. Blei,et al.  Scalable Recommendation with Poisson Factorization , 2013, ArXiv.

[46]  Michael Jahrer,et al.  Collaborative Filtering Ensemble for Ranking , 2012, KDD Cup.

[47]  Dietmar Jannach,et al.  What recommenders recommend: an analysis of recommendation biases and possible countermeasures , 2015, User Modeling and User-Adapted Interaction.

[48]  Li Chen,et al.  Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence GBPR: Group Preference Based Bayesian Personalized Ranking for One-Class Collaborative Filtering , 2022 .

[49]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[50]  Pablo Castells,et al.  A Probabilistic Reformulation of Memory-Based Collaborative Filtering: Implications on Popularity Biases , 2017, SIGIR.

[51]  William W. Cohen,et al.  Recommendation as Classification: Using Social and Content-Based Information in Recommendation , 1998, AAAI/IAAI.

[52]  Guy Shani,et al.  Evaluating Recommender Systems , 2015, Recommender Systems Handbook.

[53]  Gerald J. Hahn,et al.  More intelligent statistical software and statistical expert systems: future directions , 1985 .

[54]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[55]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[56]  Sean M. McNee,et al.  Being accurate is not enough: how accuracy metrics have hurt recommender systems , 2006, CHI Extended Abstracts.

[57]  P. Castells,et al.  On Target Item Sampling in Offline Recommender System Evaluation , 2020, RecSys.

[58]  L. Törnqvist,et al.  How Should Relative Changes be Measured , 1985 .

[59]  Franca Garzotto,et al.  User-Centric vs. System-Centric Evaluation of Recommender Systems , 2013, INTERACT.

[60]  S. Robertson The probability ranking principle in IR , 1997 .

[61]  Zoubin Ghahramani,et al.  Probabilistic Matrix Factorization with Non-random Missing Data , 2014, ICML.

[62]  Jacob Chakareski,et al.  Spotify Me: Facebook-assisted automatic playlist generation , 2013, 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP).

[63]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[64]  Franca Garzotto,et al.  Looking for "Good" Recommendations: A Comparative Evaluation of Recommender Systems , 2011, INTERACT.

[65]  Jean Garcia-Gathright,et al.  Understanding and Evaluating User Satisfaction with Music Discovery , 2018, SIGIR.

[66]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[67]  Alejandro Bellogín,et al.  Precision-oriented evaluation of recommender systems: an algorithmic comparison , 2011, RecSys '11.

[68]  R. Preston McAfee,et al.  The cost of annoying ads , 2013, WWW '13.

[69]  Kartik Hosanagar,et al.  Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity , 2007, Manag. Sci..

[70]  Deborah Estrin,et al.  Unbiased offline recommender evaluation for missing-not-at-random implicit feedback , 2018, RecSys.

[71]  Saul Vargas,et al.  Novelty and Diversity in Recommender Systems , 2015, Recommender Systems Handbook.

[72]  Harald Steck,et al.  Evaluation of recommendations: rating-prediction and ranking , 2013, RecSys.

[73]  Alistair Moffat,et al.  Metrics, User Models, and Satisfaction , 2020, WSDM.

[74]  Ke Zhou,et al.  Uncovering Bias in Ad Feedback Data Analyses & Applications✱ , 2019, WWW.

[75]  Xiuqiang He,et al.  A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data , 2020, SIGIR.

[76]  Alejandro Bellogín,et al.  Measuring anti-relevance: a study on when recommendation algorithms produce bad suggestions , 2018, RecSys.

[77]  Walid Krichene,et al.  On Sampled Metrics for Item Recommendation , 2020, KDD.

[78]  P. Castells Characterization of Fair Experiments for Recommender System Evaluation – A Formal Analysis , 2018 .

[79]  K. Vohs,et al.  Case Western Reserve University , 1990 .

[80]  Richard S. Zemel,et al.  Collaborative prediction and ranking with non-random missing data , 2009, RecSys '09.

[81]  Kuansan Wang,et al.  PSkip: estimating relevance ranking quality from web search clickthrough data , 2009, KDD.

[82]  Richard S. Zemel,et al.  Collaborative Filtering and the Missing at Random Assumption , 2007, UAI.

[83]  Benjamin Fields,et al.  Contextualize Your Listening: The Playlist as Recommendation Engine , 2011 .

[84]  Liang He,et al.  Evaluating recommender systems , 2012, Seventh International Conference on Digital Information Management (ICDIM 2012).

[85]  Neil Yorke-Smith,et al.  LibRec: A Java Library for Recommender Systems , 2015, UMAP Workshops.