Markov Decision Processes with Arbitrary Reward Processes

We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well---in hindsight---as every stationary policy. This generalizes the classical no-regret result for repeated games. Specifically, we present an efficient online algorithm---in the spirit of reinforcement learning---that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agent's actions. Moreover, it is possible to modify the basic algorithm to cope with instances where reward observations are limited to the agent's trajectory. We present further modifications that reduce the computational cost by using function approximation and that track the optimal policy through infrequent changes.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[3]  R. Aumann Markets with a continuum of traders , 1964 .

[4]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[5]  S. M. Robinson Bounds for error in the solution set of a perturbed linear program , 1973 .

[6]  David M. Kreps,et al.  Learning Mixed Equilibria , 1993 .

[7]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[8]  J. Renegar Some perturbation theory for linear programming , 1994, Math. Program..

[9]  Andrew G. Barto,et al.  An Actor/Critic Algorithm that is Equivalent to Q-Learning , 1994, NIPS.

[10]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[13]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[14]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[15]  S. Hart,et al.  A General Class of Adaptive Strategies , 1999 .

[16]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17]  Ward Whitt,et al.  A Nonstationary Offered-Load Model for Packet Networks , 2001, Telecommun. Syst..

[18]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[19]  Neri Merhav,et al.  On sequential strategies for loss functions with memory , 2002, IEEE Trans. Inf. Theory.

[20]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[21]  Ehud Lehrer,et al.  A wide range no-regret theorem , 2003, Games Econ. Behav..

[22]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[23]  Shie Mannor,et al.  The Empirical Bayes Envelope and Regret Minimization in Competitive Markov Decision Processes , 2003, Math. Oper. Res..

[24]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[25]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[26]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[27]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).

[28]  S. Bobkov,et al.  Modified Logarithmic Sobolev Inequalities in Discrete Settings , 2006 .

[29]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[30]  Shie Mannor,et al.  Regret minimization in repeated matrix games with variable stage duration , 2008, Games Econ. Behav..