A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes

We formulate a multi-armed bandit (MAB) approach to choosing expert policies online in Markov decision processes (MDPs). Given a set of expert policies trained on a state and action space, the goal is to maximize the cumulative reward of our agent. The hope is to quickly find the best expert in our set. The MAB formulation allows us to quantify the performance of an algorithm in terms of the regret incurred from not choosing the best expert from the beginning. We first develop the theoretical framework for MABs in MDPs, and then present a basic regret decomposition identity. We then adapt the classical Upper Confidence Bounds algorithm to the problem of choosing experts in MDPs and prove that the expected regret grows at worst at a logarithmic rate. Lastly, we validate the theory on a small MDP.

[1]  Olivier Teytaud,et al.  Online Sparse Bandit for Card Games , 2011, ACG.

[2]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[4]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[7]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[8]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[9]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[10]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[11]  Danica Kragic,et al.  Multi-armed bandit models for 2D grasp planning with uncertainty , 2015, 2015 IEEE International Conference on Automation Science and Engineering (CASE).

[12]  Razvan Pascanu,et al.  Metacontrol for Adaptive Imagination-Based Optimization , 2017, ICLR.

[13]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[14]  M. Rothschild A two-armed bandit theory of market pricing , 1974 .

[15]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[16]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[17]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[18]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[19]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[20]  Martial Hebert,et al.  Multi-armed recommendation bandits for selecting state machine policies for robotic systems , 2013, 2013 IEEE International Conference on Robotics and Automation.

[21]  Peter Auer,et al.  Regret bounds for restless Markov bandits , 2012, Theor. Comput. Sci..

[22]  R. Simon,et al.  Optimal two-stage designs for phase II clinical trials. , 1989, Controlled clinical trials.

[23]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[24]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.