论文信息 - A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits - 字舞流文

A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

Algorithms for offline bandits must optimize decisions in uncertain environments using only offline data. A compelling and increasingly popular objective in offline bandits is to learn a policy which achieves low Bayesian regret with high confidence. An appealing approach to this problem, inspired by recent offline reinforcement learning results, is to maximize a form of lower confidence bound (LCB). This paper proposes a new approach that directly minimizes upper bounds on Bayesian regret using efficient conic optimization solvers. Our bounds build on connections among Bayesian regret, Value-at-Risk (VaR), and chance-constrained optimization. Compared to prior work, our algorithm attains superior theoretical offline regret bounds and better results in numerical simulations. Finally, we provide some evidence that popular LCB-style algorithms may be unsuitable for minimizing Bayesian regret in offline bandits.

M. Ghavamzadeh | Marek Petrik | Guy Tennenholtz

[1] M. Ghavamzadeh,et al. On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes , 2023, 2304.12477.

[2] B. Kveton,et al. Multi-Task Off-Policy Learning from Bandit Feedback , 2022, ICML.

[3] S. Levine,et al. Offline RL Policies Should be Trained to be Adaptive , 2022, ICML.

[4] Juan Pablo Vielma,et al. JuMP 1.0: recent improvements to a modeling language for mathematical optimization , 2022, Mathematical Programming Computation.

[5] Alekh Agarwal,et al. Adversarially Trained Actor Critic for Offline Reinforcement Learning , 2022, ICML.

[6] Masatoshi Uehara,et al. Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage , 2021, ICLR.

[7] Anca D. Dragan,et al. Policy Gradient Bayesian Robust Optimization for Imitation Learning , 2021, ICML.

[8] Tor Lattimore,et al. On the Optimality of Batch Policy Optimization Algorithms , 2021, ICML.

[9] Stuart J. Russell,et al. Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[10] Brian T. Denton,et al. Multi-model Markov decision processes , 2021, IISE Trans..

[11] D. Bertsimas,et al. Probabilistic Guarantees in Robust Optimization , 2021, SIAM Journal on Optimization.

[12] Zhuoran Yang,et al. Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[13] Mohammad Ghavamzadeh,et al. Soft-Robust Algorithms for Handling Model Misspecification , 2020, ArXiv.

[14] Marek Petrik,et al. Bayesian Robust Optimization for Imitation Learning , 2020, NeurIPS.

[15] Vishal Gupta,et al. Near-Optimal Bayesian Ambiguity Sets for Distributionally Robust Optimization , 2019, Manag. Sci..

[16] Marek Petrik,et al. Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , 2019, NeurIPS.

[17] Michael I. Jordan,et al. A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[18] Yong Xia,et al. Chebyshev center of the intersection of balls: complexity, relaxation and approximation , 2019, Mathematical Programming.

[19] Timothy A. Mann,et al. Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI 2018.

[20] Amir Ahmadi-Javid,et al. Entropic Value-at-Risk: A New Coherent Risk Measure , 2012, J. Optim. Theory Appl..

[21] Sham M. Kakade,et al. A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[22] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[23] John K Kruschke,et al. Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[24] Alexander Shapiro,et al. Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[25] James R. Luedtke,et al. A Sample Approximation Approach for Optimization with Probabilistic Constraints , 2008, SIAM J. Optim..

[26] Alexander Shapiro,et al. Convex Approximations of Chance Constrained Programs , 2006, SIAM J. Optim..

[27] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[28] P. Massart,et al. Adaptive estimation of a quadratic functional by model selection , 2000 .

[29] János D. Pintér,et al. Deterministic approximations of probability inequalities , 1989, ZOR Methods Model. Oper. Res..

[30] M. Ghavamzadeh,et al. Entropic Risk Optimization in Discounted MDPs , 2023, AISTATS.

[31] Marek Petrik,et al. Optimizing Percentile Criterion using Robust MDPs , 2021, AISTATS.

[32] Caroline Ponzoni Carvalho Chanel,et al. Exploitation vs Caution: Risk-sensitive Policies for Offline Learning , 2021, ArXiv.

[33] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[34] Shie Mannor,et al. Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[35] Laurent El Ghaoui,et al. Robust Optimization , 2021, ICORES.

[36] A. Nemirovski,et al. Scenario Approximations of Chance Constraints , 2006 .

[37] Giuseppe Carlo Calafiore,et al. Uncertain convex programs: randomized solutions and confidence levels , 2005, Math. Program..

[38] H. Föllmer,et al. Stochastic Finance: An Introduction in Discrete Time , 2002 .