Conservative Bandits

We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time. While previous work addressed the problem under the weaker requirement of maintaining the revenue constraint only at a given fixed time in the future, the design of those algorithms makes them unsuitable under the more stringent constraints. We consider both the stochastic and the adversarial settings, where we propose natural yet novel strategies and analyze the price for maintaining the constraints. Amongst other things, we prove both high probability and expectation bounds on the regret, while we also consider both the problem of maintaining the constraints with high probability or expectation. For the adversarial setting the price of maintaining the constraint appears to be higher, at least for the algorithm considered. A lower bound is given showing that the algorithm for the stochastic setting is almost optimal. Empirical results obtained in synthetic environments complement our theoretical findings.

[1]  Marcus Hutter,et al.  Adaptive Online Prediction by Following the Perturbed Leader , 2005, J. Mach. Learn. Res..

[2]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[3]  Aurélien Garivier,et al.  Informational confidence bounds for self-normalized averages and applications , 2013, 2013 IEEE Information Theory Workshop (ITW).

[4]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[5]  Tor Lattimore,et al.  The Pareto Regret Frontier for Bandits , 2015, NIPS.

[6]  Oliver Lemon,et al.  Learning Effective Multimodal Dialogue Strategies from Wizard-of-Oz Data: Bootstrapping and Evaluation , 2008, ACL.

[7]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[8]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[9]  Alessandro Lazaric,et al.  Exploiting easy data in online optimization , 2014, NIPS.

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Yishay Mansour,et al.  Regret to the best vs. regret to the average , 2007, Machine Learning.

[12]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[13]  Martin A. Riedmiller,et al.  Distributed policy search reinforcement learning for job-shop scheduling tasks , 2012 .

[14]  Zoran Popovic,et al.  Towards automatic experimentation of educational knowledge , 2014, CHI.

[15]  Wouter M. Koolen The Pareto Regret Frontier , 2013, NIPS.

[16]  Tor Lattimore,et al.  Optimally Confident UCB : Improved Regret for Finite-Armed Bandits , 2015, ArXiv.

[17]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[18]  H Robbins,et al.  Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.