On the Optimality of Batch Policy Optimization Algorithms

Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain underdeveloped. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidenceadjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.

[1]  John Langford,et al.  Empirical Likelihood for Contextual Bandits , 2019, NeurIPS.

[2]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[3]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[4]  Shie Mannor,et al.  Distributional Robustness and Regularization in Reinforcement Learning , 2020, ArXiv.

[5]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[6]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[7]  I. Gilboa,et al.  Maxmin Expected Utility with Non-Unique Prior , 1989 .

[8]  Emma Brunskill,et al.  Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.

[9]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[10]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[11]  Daniel Kuhn,et al.  From Data to Decisions: Distributionally Robust Optimization is Optimal , 2017, Manag. Sci..

[12]  Daniel Kuhn,et al.  A General Framework for Optimal Data-Driven Optimization , 2020, 2010.06606.

[13]  Shie Mannor,et al.  Distributionally Robust Markov Decision Processes , 2010, Math. Oper. Res..

[14]  Yu Bai,et al.  Near-Optimal Offline Reinforcement Learning via Double Variance Reduction , 2021, ArXiv.

[15]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[16]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[17]  Marc G. Bellemare,et al.  The Importance of Pessimism in Fixed-Dataset Policy Optimization , 2020, ArXiv.

[18]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[19]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[20]  Huan Xu,et al.  Distributionally Robust Counterpart in Markov Decision Processes , 2015, IEEE Transactions on Automatic Control.

[21]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[22]  Csaba Szepesvári,et al.  CoinDICE: Off-Policy Confidence Interval Estimation , 2020, NeurIPS.

[23]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[24]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[25]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[26]  Henry Lam,et al.  Recovering Best Statistical Guarantees via the Empirical Divergence-Based Distributionally Robust Optimization , 2016, Oper. Res..

[27]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[28]  Zhi Chen,et al.  Distributionally robust optimization for sequential decision-making , 2018, Optimization.

[29]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[30]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[31]  Insoon Yang,et al.  A Convex Optimization Approach to Distributionally Robust Markov Decision Processes With Wasserstein Distance , 2017, IEEE Control Systems Letters.

[32]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[33]  Elena Smirnova,et al.  Distributionally Robust Counterfactual Risk Minimization , 2019, AAAI.

[34]  Daniel Kuhn,et al.  "Dice"-sion-Making Under Uncertainty: When Can a Random Decision Reduce Risk? , 2016, Manag. Sci..

[35]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[36]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.