Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after T steps achieves $\tilde{O}(\sqrt{T})$ regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.

[1]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[2]  M. Nair On Chebyshev-Type Inequalities for Primes , 1982 .

[3]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[4]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[5]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[6]  D. Aldous Threshold limits for cover times , 1991 .

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[9]  Balaraman Ravindran,et al.  Model Minimization in Hierarchical Reinforcement Learning , 2002, SARA.

[10]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Ronald Ortner,et al.  Pseudometrics for State Aggregation in Average Reward Markov Decision Processes , 2007, ALT.

[13]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[14]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[15]  Marcus Hutter,et al.  On the Possibility of Learning in Reactive Environments with Arbitrary Dependence , 2008, Theor. Comput. Sci..

[16]  Doina Precup,et al.  Bounding Performance Loss in Approximate MDP Homomorphisms , 2008, NIPS.

[17]  Ian F. Akyildiz,et al.  A survey on spectrum management in cognitive radio networks , 2008, IEEE Communications Magazine.

[18]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[19]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[20]  Aurélien Garivier,et al.  Optimally Sensing a Single Channel Without Prior Information: The Tiling Algorithm and Regret Bounds , 2011, IEEE Journal of Selected Topics in Signal Processing.

[21]  Rémi Munos,et al.  Selecting the State-Representation in Reinforcement Learning , 2011, NIPS.

[22]  Mingyan Liu,et al.  Adaptive learning of uncontrolled restless bandits with logarithmic regret , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[23]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[24]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[25]  Phuong Nguyen,et al.  Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning , 2013, ICML.