论文信息 - Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games

Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games

This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. We provide a novel and unified error propagation analysis in Lp-norm of three well-known algorithms adapted to Stochastic Games (namely Approximate Value Iteration, Approximate Policy Iteration and Approximate Generalized Policy Iteration). We show that we can achieve a stationary policy which is 2γe+e′/(1-γ)2 -optimal, where e is the value function approximation error and e′ is the approximate greedy operator error. In addition, we provide a practical algorithm (AGPI-Q) to solve infinite horizon γ-discounted two-player zero-sum Stochastic Games in a batch setting. It is an extension of the Fitted-Q algorithm (which solves Markov Decisions Processes from data) and can be non-parametric. Finally, we demonstrate experimentally the performance of AGPI-Q on a simultaneous two-player game, namely Alesia.

[1] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[2] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[3] Bart De Schutter,et al. A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4] Bruno Scherrer,et al. On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes , 2012, NIPS.

[5] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[7] Michail G. Lagoudakis,et al. Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[8] J. Wal. Discounted Markov games: Generalized policy iteration method , 1978 .

[9] Alessandro Lazaric,et al. Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[10] E. Kandel,et al. Proceedings of the National Academy of Sciences of the United States of America. Annual subject and author indexes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[11] Narendra Karmarkar,et al. A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[12] Bruno Scherrer,et al. Classification-based Policy Iteration with a Critic , 2011, ICML.

[13] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[15] Peter Bro Miltersen,et al. Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[16] Matthieu Geist,et al. Approximate Modified Policy Iteration , 2012, ICML.

[17] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[18] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[19] L. Shapley,et al. Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[20] J. Neumann,et al. Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[21] Keith B. Hall,et al. Correlated Q-Learning , 2003, ICML.

[22] Dimitri P. Bertsekas,et al. Stochastic shortest path games: theory and algorithms , 1997 .

[23] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[24] Michael P. Wellman,et al. Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[25] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[26] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[27] Michail G. Lagoudakis,et al. Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[28] Jean-Gabriel Ganascia,et al. Learning Strategies in Games by Anticipation , 1997, IJCAI.