Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

We consider an upper confidence bound algorithm for learning in Markov decision processes with deterministic transitions. For this algorithm we derive upper bounds on the online regret with respect to an (@e-)optimal policy that are logarithmic in the number of steps taken. We also present a corresponding lower bound. As an application, multi-armed bandits with switching cost are considered.

[1]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[2]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[3]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[4]  Ronald Ortner,et al.  Pseudometrics for State Aggregation in Average Reward Markov Decision Processes , 2007, ALT.

[5]  DE Economist A SURVEY ON THE BANDIT PROBLEM WITH SWITCHING COSTS , 2004 .

[6]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[7]  Robert E. Tarjan,et al.  Faster parametric shortest path and minimum-balance algorithms , 1991, Networks.

[8]  T. Lai,et al.  Optimal Learning and Experimentation in Bandit Problems , 2000 .

[9]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[11]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[12]  D. Teneketzis,et al.  Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost , 1988 .

[13]  Sandy Irani,et al.  Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems , 1999, DAC '99.

[14]  Jeffrey J. Hunter,et al.  Mixing times with applications to perturbed Markov chains , 2006 .

[15]  Omid Madani,et al.  Polynomial Value Iteration Algorithms for Detrerminstic MDPs , 2002, UAI.

[16]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[17]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[18]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[19]  Ronald Ortner Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[20]  Richard M. Karp,et al.  A characterization of the minimum cycle mean in a digraph , 1978, Discret. Math..

[21]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[22]  C. D. Meyer,et al.  Markov chain sensitivity measured by mean first passage times , 2000 .

[23]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[24]  Rajesh K. Gupta,et al.  Faster maximum and minimum mean cycle algorithms for system-performance analysis , 1998, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[25]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[26]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[27]  James B. Orlin,et al.  Finding minimum cost to time ratio cycles with small integral transit times , 1993, Networks.

[28]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[29]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.