Safe Exploration for Optimizing Contextual Bandits

Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.

[1]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[2]  Thorsten Joachims,et al.  Unbiased Learning-to-Rank with Biased Feedback , 2016, WSDM.

[3]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[4]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[5]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[6]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[7]  Michèle Sebag,et al.  Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[8]  M. de Rijke,et al.  Balancing Speed and Quality in Online Learning to Rank for Information Retrieval , 2017, CIKM.

[9]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10]  Ashish Kapoor,et al.  Risk-Aware Algorithms for Adversarial Contextual Bandits , 2016, ArXiv.

[11]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[12]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[13]  Jiafeng Guo,et al.  Reinforcement Learning to Rank with Markov Decision Process , 2017, SIGIR.

[14]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[15]  M. de Rijke,et al.  Bayesian Ranker Comparison Based on Historical User Interactions , 2015, SIGIR.

[16]  M. de Rijke,et al.  To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions , 2019, SIGIR.

[17]  Katja Hofmann,et al.  Balancing Exploration and Exploitation in Learning to Rank Online , 2011, ECIR.

[18]  M. de Rijke,et al.  An Introduction to Click Models for Web Search: SIGIR 2015 Tutorial , 2015, SIGIR.

[19]  M. de Rijke,et al.  Modeling clicks beyond the first result page , 2013, CIKM.

[20]  JoachimsThorsten,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015 .

[21]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  M. de Rijke,et al.  Online Learning to Rank for Information Retrieval: SIGIR 2016 Tutorial , 2016, SIGIR.

[23]  Chris Mesterharm,et al.  Experience-efficient learning in associative bandit problems , 2006, ICML.

[24]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[25]  Maarten de Rijke,et al.  Probabilistic Multileave Gradient Descent , 2016, ECIR.

[26]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[27]  Ambuj Tewari,et al.  Efficient bandit algorithms for online multiclass prediction , 2008, ICML '08.

[28]  Qiang Wu,et al.  Adapting boosting for information retrieval measures , 2010, Information Retrieval.

[29]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[30]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[31]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[32]  Dorota Glowacka,et al.  Bandit Algorithms in Information Retrieval , 2019, Found. Trends Inf. Retr..

[33]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[34]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[35]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[36]  Claudio Gentile,et al.  Boltzmann Exploration Done Right , 2017, NIPS.

[37]  M. de Rijke,et al.  BubbleRank: Safe Online Learning to Re-Rank via Implicit Click Feedback , 2018, UAI.

[38]  Marc Najork,et al.  Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.

[39]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[40]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[41]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[42]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[43]  Fabrizio Silvestri,et al.  Post-Learning Optimization of Tree Ensembles for Efficient Ranking , 2016, SIGIR.

[44]  Feng Fu,et al.  Risk-aware multi-armed bandit problem with application to portfolio selection , 2017, Royal Society Open Science.

[45]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[46]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[47]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[48]  Katja Hofmann,et al.  Contextual Bandits for Information Retrieval , 2011 .

[49]  Gediminas Adomavicius,et al.  Incorporating contextual information in recommender systems using a multidimensional approach , 2005, TOIS.

[50]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[51]  Leslie Pack Kaelbling,et al.  Associative Reinforcement Learning: Functions in k-DNF , 1994, Machine Learning.