论文信息 - On-line Markov Decision Processes

On-line Markov Decision Processes

We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions.

Y. Mansour | S. Kakade | Eyal Even-Dar

[1] David Haussler,et al. How to use expert advice , 1993, STOC.

[2] Manfred K. Warmuth,et al. Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[3] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4] A. Blum,et al. Universal portfolios with and without transaction costs , 1997, COLT '97.

[5] Claudio Gentile,et al. Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[6] Nimrod Megiddo,et al. How to Combine Expert (and Novice) Advice when Actions Impact the Environment? , 2003, NIPS.

[7] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[8] Avrim Blum,et al. Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[9] Nimrod Megiddo,et al. Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments , 2004, NIPS.

[10] Yishay Mansour,et al. Experts in a Markov Decision Process , 2004, NIPS.

[11] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[12] Laurent El Ghaoui,et al. Robust Solutions to Markov Decision Problems with Uncertain Transition Matrices , 2005 .

[13] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.