Towards Learning Reward Functions from User Interactions

In the physical world, people have dynamic preferences, e.g., the same situation can lead to satisfaction for some humans and to frustration for others. Personalization is called for. The same observation holds for online behavior with interactive systems. It is natural to represent the behavior of users who are engaging with interactive systems such as a search engine or a recommender system, as a sequence of actions where each next action depends on the current situation and the user reward of taking a particular action. By and large, current online evaluation metrics for interactive systems such as search engines or recommender systems, are static and do not reflect differences in user behavior. They rarely capture or model the reward experienced by a user while interacting with an interactive system. We argue that knowing a user's reward function is essential for an interactive system as both for learning and evaluation. We propose to learn users' reward functions directly from observed interaction traces. In particular, we present how users' reward functions can be uncovered directly using inverse reinforcement learning techniques. We also show how to incorporate user features into the learning process. Our main contribution is a novel and dynamic approach to restore a user's reward function. We present an analytic approach to this problem and complement it with initial experiments using the interaction logs of a cultural heritage institution that demonstrate the feasibility of the approach by uncovering different reward functions for different user groups.

[1]  Anind K. Dey,et al.  Modeling and Understanding Human Routine Behavior , 2016, CHI.

[2]  Grace Hui Yang,et al.  Learning to Reinforce Search Effectiveness , 2015, ICTIR.

[3]  Imed Zitouni,et al.  Understanding User Satisfaction with Intelligent Assistants , 2016, CHIIR.

[4]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[5]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[6]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[7]  Brian D. Ziebart,et al.  Intent Prediction and Trajectory Forecasting via Predictive Inverse Linear-Quadratic Regulation , 2015, AAAI.

[8]  Madian Khabsa,et al.  Detecting Good Abandonment in Mobile Search , 2016, WWW.

[9]  Jaap Kamps,et al.  Skip or Stay: Users' Behavior in Dealing with Onsite Information Interaction Crowd-Bias , 2017, CHIIR.

[10]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[11]  Yang Song,et al.  Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.

[12]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[13]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[14]  Imed Zitouni,et al.  Predicting User Satisfaction with Intelligent Assistants , 2016, SIGIR.

[15]  Gleb Gusev,et al.  Engagement Periodicity in Search Engine Usage: Analysis and its Application to Search Quality Evaluation , 2015, WSDM.

[16]  Madian Khabsa,et al.  Is This Your Final Answer?: Evaluating the Effect of Answers on Good Abandonment in Mobile Search , 2016, SIGIR.

[17]  Maarten de Rijke,et al.  Dynamic Query Modeling for Related Content Finding , 2015, SIGIR.

[18]  Grace Hui Yang,et al.  Session Search by Direct Policy Learning , 2015, ICTIR.

[19]  Ricardo Baeza-Yates,et al.  Online multitasking and user engagement , 2013, CIKM.

[20]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[21]  Katja Hofmann,et al.  Balancing Exploration and Exploitation in Learning to Rank Online , 2011, ECIR.

[22]  Katja Hofmann,et al.  Collective Noise Contrastive Estimation for Policy Transfer Learning , 2016, AAAI.

[23]  Anind K. Dey,et al.  Probabilistic pointing target prediction via inverse optimal control , 2012, IUI '12.

[24]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[25]  Leif Azzopardi,et al.  Modelling interaction with economic models of search , 2014, SIGIR.

[26]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[27]  Filip Radlinski,et al.  Online Evaluation for Information Retrieval , 2016, Found. Trends Inf. Retr..

[28]  Nicholas Jing Yuan,et al.  Beyond the Words: Predicting User Personality from Heterogeneous Information , 2017, WSDM.

[29]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[30]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[31]  Gleb Gusev,et al.  Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.

[32]  Dean Eckles,et al.  Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[33]  Anind K. Dey,et al.  The Principle of Maximum Causal Entropy for Estimating Interacting Processes , 2013, IEEE Transactions on Information Theory.

[34]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[35]  M. de Rijke,et al.  A Neural Click Model for Web Search , 2016, WWW.

[36]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[37]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[38]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.