Stochastic Linear Bandits with Protected Subspace

We study a variant of the stochastic linear bandit problem wherein we optimize a linear objective function but rewards are accrued only orthogonal to an unknown subspace (which we interpret as a \textit{protected space}) given only zero-order stochastic oracle access to both the objective itself and protected subspace. In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action. Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space, and (in almost all rounds) plays optimistically with respect to a confidence set for this space. We provide a $\tilde{O}(sd\sqrt{T})$ regret upper bound in the case where the action space is the complete unit ball in $\mathbb{R}^d$, $s < d$ is the dimension of the protected subspace, and $T$ is the time horizon. Moreover, we demonstrate that a discrete action space can lead to linear regret with an optimistic algorithm, reinforcing the sub-optimality of optimism in certain settings. We also show that protection constraints imply that for certain settings, no consistent algorithm can have a regret smaller than $\Omega(T^{3/4}).$ We finally empirically validate our results with synthetic and real datasets.

[1]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[2]  Maryam Kamgarpour,et al.  Log Barriers for Safe Non-convex Black-box Optimization , 2019, ArXiv.

[3]  Tor Lattimore,et al.  Cleaning up the neighborhood: A full classification for adversarial partial monitoring , 2018, ALT.

[4]  A. Rubio-Guerra,et al.  Combination therapy in the treatment of hypertension , 2018, Drugs in context.

[5]  Andreas Krause,et al.  Safe Convex Learning under Uncertain Constraints , 2019, AISTATS.

[6]  H. Zaman Huri,et al.  Drug related problems in type 2 diabetes patients with hypertension: a cross-sectional retrospective study , 2013, BMC Endocrine Disorders.

[7]  Samuel Daulton,et al.  Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints , 2019, ArXiv.

[8]  G. Remington,et al.  Antipsychotics, Metabolic Adverse Effects, and Cognitive Function in Schizophrenia , 2018, Front. Psychiatry.

[9]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[10]  Christos Thrampoulidis,et al.  Generalized Linear Bandits with Safety Constraints , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  R. Altman,et al.  Estimation of the warfarin dose with clinical and pharmacogenetic data. , 2009, The New England journal of medicine.

[13]  Weihao Kong,et al.  Meta-learning for mixed linear regression , 2020, ICML.

[14]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[15]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[16]  Srinivas Shakkottai,et al.  Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs , 2020, ArXiv.

[17]  Amin Karbasi,et al.  Safe Learning under Uncertain Objectives and Constraints , 2020, ArXiv.

[18]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[19]  Tor Lattimore,et al.  Information Directed Sampling for Linear Partial Monitoring , 2020, COLT.

[20]  Christos Thrampoulidis,et al.  Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[21]  Christos Thrampoulidis,et al.  Linear Thompson Sampling Under Unknown Linear Constraints , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).