When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments

We consider the novel problem of evaluating a recommendation policy offline in environments where the reward signal is non-stationary. Non-stationarity appears in many Information Retrieval (IR) applications such as recommendation and advertising, but its effect on off-policy evaluation has not been studied at all. We are the first to address this issue. First, we analyze standard off-policy estimators in non-stationary environments and show both theoretically and experimentally that their bias grows with time. Then, we propose new off-policy estimators with moving averages and show that their bias is independent of time and can be bounded. Furthermore, we provide a method to trade-off bias and variance in a principled way to get an off-policy estimator that works well in both non-stationary and stationary environments. We experiment on publicly available recommendation datasets and show that our newly proposed moving average estimators accurately capture changes in non-stationary environments, while standard off-policy estimators fail to do so.

[1]  Tsvi Kuflik,et al.  Workshop on information heterogeneity and fusion in recommender systems (HetRec 2010) , 2010, RecSys '10.

[2]  Maarten de Rijke,et al.  OpenSearch: Lessons Learned from an Online Evaluation Campaign , 2018, ACM J. Data Inf. Qual..

[3]  Tao Ye,et al.  Modeling Musical Taste Evolution with Recurrent Neural Networks , 2018, ArXiv.

[4]  Anmol Bhasin,et al.  From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks , 2015, KDD.

[5]  Fang Liu,et al.  A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem , 2017, AAAI.

[6]  R. Tourangeau Context Effects on Responses to Attitude Questions: Attitudes as Memory Structures , 1992 .

[7]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[8]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[9]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[10]  Claudio Gentile,et al.  A Gang of Bandits , 2013, NIPS.

[11]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Thorsten Joachims,et al.  Taste Over Time: The Temporal Dynamics of User Preferences , 2013, ISMIR.

[14]  Fabio A. González,et al.  Performance of Recommendation Systems in Dynamic Streaming Environments , 2007, SDM.

[15]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[16]  Susan T. Dumais,et al.  Short-Term Satisfaction and Long-Term Coverage: Understanding How Users Tolerate Algorithmic Exploration , 2018, WSDM.

[17]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[18]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[19]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[20]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[21]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[22]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[23]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[24]  Shie Mannor,et al.  Piecewise-stationary bandit problems with side observations , 2009, ICML '09.

[25]  Qingyun Wu,et al.  Learning Contextual Bandits in a Non-stationary Environment , 2018, SIGIR.

[26]  Joel B. Cohen,et al.  The social animal. , 1973 .

[27]  João Gama,et al.  On analyzing user preference dynamics with temporal social networks , 2018, Machine Learning.

[28]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[29]  Jiahui Liu,et al.  Personalized news recommendation based on click behavior , 2010, IUI '10.

[30]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[31]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[32]  M. de Rijke,et al.  Online Exploration for Detecting Shifts in Fresh Intent , 2014, CIKM.

[33]  Susan T. Dumais,et al.  Understanding temporal query dynamics , 2011, WSDM '11.

[34]  Susan T. Dumais,et al.  Modeling and predicting behavioral dynamics on the web , 2012, WWW.

[35]  F. Strack,et al.  Context Effects in Attitude Surveys: Applying Cognitive Theory to Social Research , 1991 .

[36]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[37]  Chris P. Tsokos,et al.  Mathematical Statistics with Applications , 2009 .

[38]  John Langford,et al.  Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits , 2012, UAI.

[39]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[40]  Modeling of Holiday Effects and Seasonality in Daily Time Series , 2018 .

[41]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[42]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.