论文信息 - Value Function Approximation in Zero-Sum Markov Games

Value Function Approximation in Zero-Sum Markov Games

This paper investigates value function approximation in the context of zero-sum Markov games, which can be viewed as a generalization of the Markov decision process (MDP) framework to the two-agent case. We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games. We present a generalization of the optimal stopping problem to a two-player simultaneous move Markov game. For this special problem, we provide stronger bounds and can guarantee convergence for LSTD and temporal difference learning with linear value function approximation. We demonstrate the viability of value function approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control problem.

Michail G. Lagoudakis | Ronald Parr | Ronald E. Parr | M. Lagoudakis

[1] Eitan Altman,et al. Flow control using the theory of zero sum Markov games , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[2] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[3] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[4] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[5] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6] Benjamin Van Roy. Learning and value function approximation in complex decision processes , 1998 .

[7] Csaba Szepesvári,et al. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[8] Daphne Koller,et al. Policy Iteration for Factored MDPs , 2000, UAI.

[9] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[10] Michail G. Lagoudakis,et al. Model-Free Least-Squares Policy Iteration , 2001, NIPS.

[11] Nicolas Vieille,et al. Quitting Games , 2001, Math. Oper. Res..

[12] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.