Synergizing reinforcement learning and game theory - A new direction for control

Reinforcement learning (RL) has now evolved as a major technique for adaptive optimal control of nonlinear systems. However, majority of the RL algorithms proposed so far impose a strong constraint on the structure of environment dynamics by assuming that it operates as a Markov decision process (MDP). An MDP framework envisages a single agent operating in a stationary environment thereby limiting the scope of application of RL to control problems. Recently, a new direction of research has focused on proposing Markov games as an alternative system model to enhance the generality and robustness of the RL based approaches. This paper aims to present this new direction that seeks to synergize broad areas of RL and Game theory, as an interesting and challenging avenue for designing intelligent and reliable controllers. First, we briefly review some representative RL algorithms for the sake of completeness and then describe the recent direction that seeks to integrate RL and game theory. Finally, open issues are identified and future research directions outlined.

[1]  Rajneesh Sharma,et al.  A Markov Game-Adaptive Fuzzy Controller for Robot Manipulators , 2008, IEEE Transactions on Fuzzy Systems.

[2]  T. T. Shannon,et al.  Adaptive critic based adaptation of a fuzzy policy manager for a logistic system , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[3]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[4]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[5]  O. J. Vrieze,et al.  Surveys in game theory and related topics , 1987 .

[6]  Ranjan K. Mallik,et al.  Analysis of an on-off jamming situation as a dynamic game , 2000, IEEE Trans. Commun..

[7]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[8]  Makoto Yokoo,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[9]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[10]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[11]  Roberto A. Santiago,et al.  Adaptive critic designs: A case study for neurocontrol , 1995, Neural Networks.

[12]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[13]  Michail G. Lagoudakis,et al.  Learning in Zero-Sum Team Markov Games Using Factored Value Functions , 2002, NIPS.

[14]  O. J. Vrieze,et al.  Stochastic Games with Finite State and Action Spaces. , 1988 .

[15]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[16]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  N.S. Patel Lot allocation and process control in semiconductor manufacturing - a dynamic game approach , 2004, 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601).

[19]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[20]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[21]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[22]  Glenn A. Iba,et al.  A Heuristic Approach to the Discovery of Macro-Operators , 1989, Machine Learning.

[23]  Shlomo Zilberstein,et al.  Finite-memory control of partially observable systems , 1998 .

[24]  Eitan Altman,et al.  Multiuser rate-based flow control , 1998, IEEE Trans. Commun..

[25]  Masato Ishikawa,et al.  43rd IEEE Conference on Decision and Control , 2005 .

[26]  Rajneesh Sharma,et al.  A Safe and Consistent Game-theoretic Controller for Nonlinear Systems , 2005, IICAI.

[27]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[28]  Chuan-Kai Lin,et al.  A reinforcement learning adaptive fuzzy controller for robots , 2003, Fuzzy Sets Syst..

[29]  Shlomo Zilberstein,et al.  Formal models and algorithms for decentralized decision making under uncertainty , 2008, Autonomous Agents and Multi-Agent Systems.

[30]  Takashi Maeda,et al.  On characterization of equilibrium strategy of two-person zero-sum games with fuzzy payoffs , 2003, Fuzzy Sets Syst..

[31]  George G. Lendaris,et al.  Dual heuristic programming for fuzzy control , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[32]  Michael H. Bowling,et al.  Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[33]  Henrik Schiøler,et al.  Trophallaxis in robotic swarms - beyond energy autonomy , 2008, 2008 10th International Conference on Control, Automation, Robotics and Vision.

[34]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[35]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[36]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[37]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[38]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[39]  Niket S. Kaisare,et al.  Simulation based strategy for nonlinear optimal control: application to a microbial cell reactor , 2003 .

[40]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[41]  Bor-Sen Chen,et al.  Robust tracking enhancement of robot systems including motor dynamics: a fuzzy-based dynamic game approach , 1998, IEEE Trans. Fuzzy Syst..

[42]  Ilya V. Kolmanovsky,et al.  Predictive energy management of a power-split hybrid electric vehicle , 2009, 2009 American Control Conference.

[43]  Dimitri Jeltsema,et al.  Proceedings Of The 2000 American Control Conference , 2000, Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334).

[44]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[45]  Leslie Pack Kaelbling,et al.  Associative Reinforcement Learning: Functions in k-DNF , 1994, Machine Learning.

[46]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[47]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[48]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[49]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[50]  Rajneesh Sharma,et al.  Hybrid Game Strategy in Fuzzy Markov-Game-Based Control , 2008, IEEE Transactions on Fuzzy Systems.

[51]  Victor R. Lesser,et al.  Planning for Weakly-Coupled Partially Observable Stochastic Games , 2005, IJCAI.

[52]  Tansu Alpcan,et al.  Distributed Algorithms for Nash Equilibria of Flow Control Games , 2005 .

[53]  Lionel Jouffe,et al.  Fuzzy inference system learning by reinforcement methods , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[54]  Jong Min Lee,et al.  Approximate Dynamic Programming Strategies and Their Applicability for Process Control: A Review and Future Directions , 2004 .

[55]  Frank L. Lewis,et al.  Direct-reinforcement-adaptive-learning neural network control for nonlinear systems , 1997, Proceedings of the 1997 American Control Conference (Cat. No.97CH36041).

[56]  Jeng-Yih Chiou,et al.  Reinforcement learning in zero-sum Markov games for robot soccer systems , 2004, IEEE International Conference on Networking, Sensing and Control, 2004.

[57]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[58]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[59]  Rajneesh Sharma,et al.  A robust Markov game controller for nonlinear systems , 2007, Appl. Soft Comput..

[60]  Frank L. Lewis,et al.  Adaptive Critic Designs for Discrete-Time Zero-Sum Games With Application to $H_{\infty}$ Control , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[61]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[62]  Ann Nowé,et al.  Evolutionary game theory and multi-agent reinforcement learning , 2005, The Knowledge Engineering Review.

[63]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[64]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[65]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[66]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[67]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[68]  Michael H. Bowling,et al.  Convergence Problems of General-Sum Multiagent Reinforcement Learning , 2000, ICML.

[69]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[70]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[71]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[72]  Gregory Z. Grudic,et al.  Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning , 2001, NIPS.

[73]  Bernard Widrow,et al.  Punish/Reward: Learning with a Critic in Adaptive Threshold Systems , 1973, IEEE Trans. Syst. Man Cybern..

[74]  Michael L. Littman,et al.  Value-function reinforcement learning in Markov games , 2001, Cognitive Systems Research.

[75]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[76]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[77]  T. Basar,et al.  H∞-0ptimal Control and Related Minimax Design Problems: A Dynamic Game Approach , 1996, IEEE Trans. Autom. Control..

[78]  Anne Condon,et al.  On Algorithms for Simple Stochastic Games , 1990, Advances In Computational Complexity Theory.

[79]  Bart De Schutter,et al.  Multi-Agent Reinforcement Learning: A Survey , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[80]  Hyeong Soo Chang,et al.  Two-person zero-sum Markov games: receding horizon approach , 2003, IEEE Trans. Autom. Control..

[81]  Madan Gopal,et al.  SVM-Based Tree-Type Neural Networks as a Critic in Adaptive Critic Designs for Control , 2007, IEEE Transactions on Neural Networks.

[82]  Jeff G. Schneider,et al.  Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..