Online Markov Decision Processes Under Bandit Feedback

We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T2/3lnT). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T1/2lnT), giving the first rigorously proven, essentially tight regret bound for the problem.

[1]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[2]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[3]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[4]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[7]  Ambuj Tewari,et al.  Online Learning: Stochastic and Constrained Adversaries , 2011, ArXiv.

[8]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[9]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[10]  Claudio Gentile,et al.  Proceedings of the 20th annual conference on Learning theory , 2007 .

[11]  Csaba Szepesvari,et al.  Markov Decision Processes under Bandit Feedback , 2015 .

[12]  Varun Grover,et al.  Active learning in heteroscedastic noise , 2010, Theor. Comput. Sci..

[13]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[14]  Ilse C. F. Ipsen,et al.  Ergodicity Coefficients Defined by Vector Norms , 2011, SIAM J. Matrix Anal. Appl..

[15]  Alessandro Lazaric,et al.  Learning with stochastic inputs and adversarial outputs , 2012, J. Comput. Syst. Sci..

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[18]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[19]  Tamás Linder,et al.  The On-Line Shortest Path Problem Under Partial Monitoring , 2007, J. Mach. Learn. Res..

[20]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[21]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[22]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[23]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.