Expert Selection in High-Dimensional Markov Decision Processes

In this work we present a multi-armed bandit framework for online expert selection in Markov decision processes and demonstrate its use in high-dimensional settings. Our method takes a set of candidate expert policies and switches between them to rapidly identify the best performing expert using a variant of the classical upper confidence bound algorithm, thus ensuring low regret in the overall performance of the system. This is useful in applications where several expert policies may be available, and one needs to be selected at run-time for the underlying environment.

[1]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[2]  S. Shankar Sastry,et al.  A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes , 2017, ArXiv.

[3]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[4]  Danica Kragic,et al.  Multi-armed bandit models for 2D grasp planning with uncertainty , 2015, 2015 IEEE International Conference on Automation Science and Engineering (CASE).

[5]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[6]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[7]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[8]  Yu-Jin Zhang,et al.  A Highly Effective Impulse Noise Detection Algorithm for Switching Median Filters , 2010, IEEE Signal Processing Letters.

[9]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10]  Razvan Pascanu,et al.  Metacontrol for Adaptive Imagination-Based Optimization , 2017, ICLR.

[11]  Lina J. Karam,et al.  Understanding how image quality affects deep neural networks , 2016, 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX).

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[14]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[15]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[16]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[17]  Firas Ajil Jassim,et al.  Image Denoising Using Interquartile Range Filter with Local Averaging , 2013, ArXiv.

[18]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[19]  Peter Auer,et al.  Regret bounds for restless Markov bandits , 2012, Theor. Comput. Sci..

[20]  Sangram Ganguly,et al.  Learning Sparse Feature Representations Using Probabilistic Quadtrees and Deep Belief Nets , 2015, Neural Processing Letters.

[21]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.