Markov Decision Processes under Bandit Feedback

We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T 2/3 lnT ). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O ( T 1/2 lnT ) , giving the first rigorously proven, essentially tight regret bound for the problem.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[3]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[4]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[7]  Magyar Tud The On-Line Shortest Path Problem Under Partial Monitoring , 2007 .

[8]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[9]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[10]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[11]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[12]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[13]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[14]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.

[15]  Csaba Szepesvari,et al.  The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.

[16]  Varun Grover,et al.  Active learning in heteroscedastic noise , 2010, Theor. Comput. Sci..

[17]  Ambuj Tewari,et al.  Online Learning: Stochastic and Constrained Adversaries , 2011, ArXiv.

[18]  Ilse C. F. Ipsen,et al.  Ergodicity Coefficients Defined by Vector Norms , 2011, SIAM J. Matrix Anal. Appl..

[19]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[20]  Alessandro Lazaric,et al.  Learning with stochastic inputs and adversarial outputs , 2012, J. Comput. Syst. Sci..

[21]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.