Generalized Linear Bandits with Safety Constraints

The classical multi-armed bandit is a class of sequential decision making problems where selecting actions incurs costs that are sampled independently from an unknown underlying distribution. Bandit algorithms have many applications in safety critical systems, where several constraints must be respected during the run of the algorithm in spite of uncertainty about problem parameters. This paper formulates a generalized linear stochastic multi-armed bandit problem with generalized linear safety constraints that depend on an unknown parameter vector. In this setting, we propose a Safe UCB-GLM algorithm for which we provide general and problem-dependent regret bounds.

[1]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[2]  Yasin Abbasi-Yadkori Forced-Exploration Based Algorithms for Playing in Stochastic Linear Bandits , 2009 .

[3]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[4]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[5]  Robert D. Nowak,et al.  Bilinear Bandits with Low-rank Structure , 2019, ICML.

[6]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[7]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[8]  Joel W. Burdick,et al.  Stagewise Safe Bayesian Optimization with Gaussian Processes , 2018, ICML.

[9]  Andreas Krause,et al.  Safe Convex Learning under Uncertain Constraints , 2019, AISTATS.

[10]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[13]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[14]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[15]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[16]  Christos Thrampoulidis,et al.  Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[17]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..