论文信息 - When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments - 字舞流文

When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments

We consider the novel problem of evaluating a recommendation policy offline in environments where the reward signal is non-stationary. Non-stationarity appears in many Information Retrieval (IR) applications such as recommendation and advertising, but its effect on off-policy evaluation has not been studied at all. We are the first to address this issue. First, we analyze standard off-policy estimators in non-stationary environments and show both theoretically and experimentally that their bias grows with time. Then, we propose new off-policy estimators with moving averages and show that their bias is independent of time and can be bounded. Furthermore, we provide a method to trade-off bias and variance in a principled way to get an off-policy estimator that works well in both non-stationary and stationary environments. We experiment on publicly available recommendation datasets and show that our newly proposed moving average estimators accurately capture changes in non-stationary environments, while standard off-policy estimators fail to do so.

M. de Rijke | Maarten de Rijke | Rolf Jagerman | Ilya Markov | I. Markov | R. Jagerman

[1] Tsvi Kuflik,et al. Workshop on information heterogeneity and fusion in recommender systems (HetRec 2010) , 2010, RecSys '10.

[2] Maarten de Rijke,et al. OpenSearch: Lessons Learned from an Online Evaluation Campaign , 2018, ACM J. Data Inf. Qual..

[3] Tao Ye,et al. Modeling Musical Taste Evolution with Recurrent Neural Networks , 2018, ArXiv.

[4] Anmol Bhasin,et al. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[5] Fang Liu,et al. A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem , 2017, AAAI.

[6] R. Tourangeau. Context Effects on Responses to Attitude Questions: Attitudes as Memory Structures , 1992 .

[7] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[8] Eric Moulines,et al. On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[9] Katja Hofmann,et al. Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[10] Claudio Gentile,et al. A Gang of Bandits , 2013, NIPS.

[11] Fernando Diaz,et al. Integration of news content into web results , 2009, WSDM '09.

[12] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13] Thorsten Joachims,et al. Taste Over Time: The Temporal Dynamics of User Preferences , 2013, ISMIR.

[14] Fabio A. González,et al. Performance of Recommendation Systems in Dynamic Streaming Environments , 2007, SDM.

[15] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[16] Susan T. Dumais,et al. Short-Term Satisfaction and Long-Term Coverage: Understanding How Users Tolerate Algorithmic Exploration , 2018, WSDM.

[17] Omar Besbes,et al. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[18] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[19] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[20] P. Whittle. Restless Bandits: Activity Allocation in a Changing World , 1988 .

[21] D. Horvitz,et al. A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[22] John Langford,et al. Off-policy evaluation for slate recommendation , 2016, NIPS.

[23] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[24] Shie Mannor,et al. Piecewise-stationary bandit problems with side observations , 2009, ICML '09.

[25] Qingyun Wu,et al. Learning Contextual Bandits in a Non-stationary Environment , 2018, SIGIR.

[26] Joel B. Cohen,et al. The social animal. , 1973 .

[27] João Gama,et al. On analyzing user preference dynamics with temporal social networks , 2018, Machine Learning.

[28] Miroslav Dudík,et al. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[29] Jiahui Liu,et al. Personalized news recommendation based on click behavior , 2010, IUI '10.

[30] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[31] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[32] M. de Rijke,et al. Online Exploration for Detecting Shifts in Fresh Intent , 2014, CIKM.

[33] Susan T. Dumais,et al. Understanding temporal query dynamics , 2011, WSDM '11.

[34] Susan T. Dumais,et al. Modeling and predicting behavioral dynamics on the web , 2012, WWW.

[35] F. Strack,et al. Context Effects in Attitude Surveys: Applying Cognitive Theory to Social Research , 1991 .

[36] Philip S. Thomas,et al. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[37] Chris P. Tsokos,et al. Mathematical Statistics with Applications , 2009 .

[38] John Langford,et al. Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits , 2012, UAI.

[39] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.

[40] Modeling of Holiday Effects and Seasonality in Daily Time Series , 2018 .

[41] J. Robins,et al. Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[42] A. Tversky,et al. Judgment under Uncertainty: Heuristics and Biases , 1974, Science.