Standing in Your Shoes: External Assessments for Personalized Recommender Systems

The evaluation of recommender systems relies on user preference data, which is difficult to acquire directly because of its subjective nature. Current recommender systems widely utilize users' historical interactions as implicit or explicit feedback, but such data usually suffers from various types of bias. Little work has been done on collecting and understanding user's personal preferences via third-party annotations. External assessments, that is, annotations made by assessors who are not the systems' users, have been widely used in information search scenarios. Is it possible to use external assessments to construct user preference labels? This paper presents the first attempt to incorporate external assessments into preference labeling and recommendation evaluation. The aim is to verify the possibility and reliability of external assessments for personalized recommender systems. We collect both users' real preferences and assessors' estimated preferences through a multi-role, multi-session user study. By investigating the inter-assessor agreement and user-assessor consistency, we demonstrate the reasonable stability and high accuracy of external preference assessments. Furthermore, we investigate the usage of external assessments in system evaluation. A higher degree of consistency with users' online feedback is observed, even better than traditional history-based online evaluation. Our findings show that external assessments can be used for assessing user preference labels and evaluating systems in personalized recommendation scenarios.

[1]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[2]  Maarten de Rijke,et al.  Click Models for Web Search and their Applications to IR: WSDM 2016 Tutorial , 2016, WSDM '16.

[3]  Adam Tauman Kalai,et al.  A Crowd of Your Own: Crowdsourcing for On-Demand Personalization , 2014, HCOMP.

[4]  Peifeng Yin,et al.  Silence is also evidence: interpreting dwell time for recommendation from psychological perspective , 2013, KDD.

[5]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[6]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[7]  Pingmei Xu,et al.  Towards Measuring and Inferring User Interest from Gaze , 2017, WWW.

[8]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[9]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[10]  Jeff Huang Web User Interaction Mining from Touch-Enabled Mobile Devices , 2012 .

[11]  Shaoping Ma,et al.  Between Clicks and Satisfaction: Study on Multi-Phase User Preferences and Satisfaction for Online News Reading , 2018, SIGIR.

[12]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[13]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[14]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Yisong Yue,et al.  Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data , 2010, WWW '10.

[16]  Laura A. Granka,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[17]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[18]  Pablo Castells,et al.  Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems , 2018, SIGIR.

[19]  Douglas W. Oard,et al.  Implicit Feedback for Recommender Systems , 1998 .

[20]  Thorsten Joachims,et al.  Unbiased Learning-to-Rank with Biased Feedback , 2016, WSDM.

[21]  Yiqun Liu,et al.  Understanding and Predicting Usefulness Judgment in Web Search , 2017, SIGIR.

[22]  GayGeri,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[23]  Maarten de Rijke,et al.  Unifying Online and Counterfactual Learning to Rank , 2020, ArXiv.

[24]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[25]  Jiahui Liu,et al.  Personalized news recommendation based on click behavior , 2010, IUI '10.

[26]  Thorsten Joachims,et al.  Estimating Position Bias without Intrusive Interventions , 2018, WSDM.

[27]  Yiqun Liu,et al.  Incorporating vertical results into search click models , 2013, SIGIR.

[28]  Joseph A. Konstan,et al.  Who predicts better?: results from an online study comparing humans and an online recommender system , 2008, RecSys '08.

[29]  Michael D. Ekstrand The LKPY Package for Recommender Systems Experiments: Next-Generation Tools and Lessons Learned from the LensKit Project , 2018, ArXiv.

[30]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[31]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[32]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[33]  Katja Hofmann,et al.  Balancing Exploration and Exploitation in Learning to Rank Online , 2011, ECIR.

[34]  M. de Rijke,et al.  Keeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforcement Learning based Recommender Systems , 2020, RecSys.

[35]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[36]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[37]  Qiang Yang,et al.  One-Class Collaborative Filtering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[38]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[39]  D. Harman,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2006 .

[40]  Falk Scholer,et al.  On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..

[41]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..