论文信息 - A Generalized Minimax Q-Learning Algorithm for Two-Player Zero-Sum Stochastic Games

A Generalized Minimax Q-Learning Algorithm for Two-Player Zero-Sum Stochastic Games

We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation. Successive relaxation has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques. Through experiments, we demonstrate the effectiveness of our proposed algorithm.

Shalabh Bhatnagar | Raghuram Bharadwaj Diddigi | Chandramouli Kamanchi | S. Bhatnagar | Chandramouli Kamanchi

[1] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[3] Shalabh Bhatnagar,et al. Successive Over Relaxation Q-Learning , 2019, IEEE Control. Syst. Lett..

[4] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[5] Dieter Reetz,et al. Solution of a Markovian decision problem by successive overrelaxation , 1973, Z. Oper. Research.

[6] J. Filar,et al. Competitive Markov Decision Processes , 1996 .

[7] Harold J. Kushner,et al. wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9] Michael P. Wellman,et al. Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[10] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[11] Michael L. Littman,et al. Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.