论文信息 - Conservative Contextual Linear Bandits

Conservative Contextual Linear Bandits

Safety is a desirable property that can immensely increase the applicability of learning algorithms in real-world decision-making problems. It is much easier for a company to deploy an algorithm that is safe, i.e., guaranteed to perform at least as well as a baseline. In this paper, we study the issue of safety in contextual linear bandits that have application in many different fields including personalized ad recommendation in online marketing. We formulate a notion of safety for this class of algorithms. We develop a safe contextual linear bandit algorithm, called conservative linear UCB (CLUCB), that simultaneously minimizes its regret and satisfies the safety constraint, i.e., maintains its performance above a fixed percentage of the performance of a baseline strategy, uniformly over time. We prove an upper-bound on the regret of CLUCB and show that it can be decomposed into two terms: 1) an upper-bound for the regret of the standard linear UCB algorithm that grows with the time horizon and 2) a constant term that accounts for the loss of being conservative in order to satisfy the safety constraint. We empirically show that our algorithm is safe and validate our theoretical analysis.

[1] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[2] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[3] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[4] Thorsten Joachims,et al. Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[5] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[6] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[7] Marek Petrik,et al. Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[8] Thorsten Joachims,et al. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[9] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[10] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[11] Yifan Wu,et al. Conservative Bandits , 2016, ICML.

[12] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[14] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.