A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

Algorithms for offline bandits must optimize decisions in uncertain environments using only offline data. A compelling and increasingly popular objective in offline bandits is to learn a policy which achieves low Bayesian regret with high confidence. An appealing approach to this problem, inspired by recent offline reinforcement learning results, is to maximize a form of lower confidence bound (LCB). This paper proposes a new approach that directly minimizes upper bounds on Bayesian regret using efficient conic optimization solvers. Our bounds build on connections among Bayesian regret, Value-at-Risk (VaR), and chance-constrained optimization. Compared to prior work, our algorithm attains superior theoretical offline regret bounds and better results in numerical simulations. Finally, we provide some evidence that popular LCB-style algorithms may be unsuitable for minimizing Bayesian regret in offline bandits.

[1]  M. Ghavamzadeh,et al.  On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes , 2023, 2304.12477.

[2]  B. Kveton,et al.  Multi-Task Off-Policy Learning from Bandit Feedback , 2022, ICML.

[3]  S. Levine,et al.  Offline RL Policies Should be Trained to be Adaptive , 2022, ICML.

[4]  Juan Pablo Vielma,et al.  JuMP 1.0: recent improvements to a modeling language for mathematical optimization , 2022, Mathematical Programming Computation.

[5]  Alekh Agarwal,et al.  Adversarially Trained Actor Critic for Offline Reinforcement Learning , 2022, ICML.

[6]  Masatoshi Uehara,et al.  Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage , 2021, ICLR.

[7]  Anca D. Dragan,et al.  Policy Gradient Bayesian Robust Optimization for Imitation Learning , 2021, ICML.

[8]  Tor Lattimore,et al.  On the Optimality of Batch Policy Optimization Algorithms , 2021, ICML.

[9]  Stuart J. Russell,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[10]  Brian T. Denton,et al.  Multi-model Markov decision processes , 2021, IISE Trans..

[11]  D. Bertsimas,et al.  Probabilistic Guarantees in Robust Optimization , 2021, SIAM Journal on Optimization.

[12]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[13]  Mohammad Ghavamzadeh,et al.  Soft-Robust Algorithms for Handling Model Misspecification , 2020, ArXiv.

[14]  Marek Petrik,et al.  Bayesian Robust Optimization for Imitation Learning , 2020, NeurIPS.

[15]  Vishal Gupta,et al.  Near-Optimal Bayesian Ambiguity Sets for Distributionally Robust Optimization , 2019, Manag. Sci..

[16]  Marek Petrik,et al.  Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , 2019, NeurIPS.

[17]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[18]  Yong Xia,et al.  Chebyshev center of the intersection of balls: complexity, relaxation and approximation , 2019, Mathematical Programming.

[19]  Timothy A. Mann,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI 2018.

[20]  Amir Ahmadi-Javid,et al.  Entropic Value-at-Risk: A New Coherent Risk Measure , 2012, J. Optim. Theory Appl..

[21]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[22]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[23]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[24]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[25]  James R. Luedtke,et al.  A Sample Approximation Approach for Optimization with Probabilistic Constraints , 2008, SIAM J. Optim..

[26]  Alexander Shapiro,et al.  Convex Approximations of Chance Constrained Programs , 2006, SIAM J. Optim..

[27]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[28]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[29]  János D. Pintér,et al.  Deterministic approximations of probability inequalities , 1989, ZOR Methods Model. Oper. Res..

[30]  M. Ghavamzadeh,et al.  Entropic Risk Optimization in Discounted MDPs , 2023, AISTATS.

[31]  Marek Petrik,et al.  Optimizing Percentile Criterion using Robust MDPs , 2021, AISTATS.

[32]  Caroline Ponzoni Carvalho Chanel,et al.  Exploitation vs Caution: Risk-sensitive Policies for Offline Learning , 2021, ArXiv.

[33]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[34]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[35]  Laurent El Ghaoui,et al.  Robust Optimization , 2021, ICORES.

[36]  A. Nemirovski,et al.  Scenario Approximations of Chance Constraints , 2006 .

[37]  Giuseppe Carlo Calafiore,et al.  Uncertain convex programs: randomized solutions and confidence levels , 2005, Math. Program..

[38]  H. Föllmer,et al.  Stochastic Finance: An Introduction in Discrete Time , 2002 .