Efficient Use of heuristics for accelerating XCS-based Policy Learning in Markov Games

In Markov games, playing against non-stationary opponents with learning ability is still challenging for reinforcement learning (RL) agents, because the opponents can evolve their policies concurrently. This increases the complexity of the learning task and slows down the learning speed of the RL agents. This paper proposes efficient use of rough heuristics to speed up policy learning when playing against concurrent learners. Specifically, we propose an algorithm that can efficiently learn explainable and generalized action selection rules by taking advantages of the representation of quantitative heuristics and an opponent model with an eXtended classifier system (XCS) in zero-sum Markov games. A neural network is used to model the opponent from their behaviors and the corresponding policy is inferred for action selection and rule evolution. In cases of multiple heuristic policies, we introduce the concept of Pareto optimality for action selection. Besides, taking advantages of the condition representation and matching mechanism of XCS, the heuristic policies and the opponent model can provide guidance for situations with similar feature representation. Furthermore, we introduce an accuracy-based eligibility trace mechanism to speed up rule evolution, i.e., classifiers that can match the historical traces are reinforced according to their accuracy. We demonstrate the advantages of the proposed algorithm over several benchmark algorithms in a soccer and a thief-and-hunter scenarios.

[1]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[2]  Siyuan Li,et al.  An Optimal Online Method of Selecting Source Policies for Reinforcement Learning , 2017, AAAI.

[3]  Yan Zheng,et al.  KoGuN: Accelerating Deep Reinforcement Learning via Integrating Human Suboptimal Knowledge , 2020, ArXiv.

[4]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[5]  Pier Luca Lanzi,et al.  A Study of the Generalization Capabilities of XCS , 1997, ICGA.

[6]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[7]  Jorge Casillas,et al.  Reinforcement learning by an accuracy-based fuzzy classifier system with real-valued output , 2008 .

[8]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[9]  Jorge Casillas,et al.  Fuzzy-XCS: A Michigan Genetic Fuzzy System , 2007, IEEE Transactions on Fuzzy Systems.

[10]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[11]  Pier Luca Lanzi,et al.  An Analysis of Generalization in the XCS Classifier System , 1999, Evolutionary Computation.

[12]  Yan Zheng,et al.  Towards Efficient Detection and Optimal Response against Sophisticated Opponents , 2018, IJCAI.

[13]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[14]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[17]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[18]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[19]  Anna Helena Reali Costa,et al.  Experience generalization for concurrent reinforcement learners: the minimax-QS algorithm , 2002, AAMAS '02.

[20]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[21]  Yan Zheng,et al.  A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents , 2018, NeurIPS.

[22]  Daniele Loiacono,et al.  XCS with computed prediction in continuous multistep environments , 2005, 2005 IEEE Congress on Evolutionary Computation.

[23]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[24]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[25]  Keiki Takadama,et al.  Exploring XCS in multiagent environments , 2005, GECCO '05.

[26]  Stewart W. Wilson Generalization in the XCS Classifier System , 1998 .

[27]  Stewart W. Wilson ZCS: A Zeroth Level Classifier System , 1994, Evolutionary Computation.

[28]  Hado Philip van Hasselt,et al.  Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms , 2011 .

[29]  Pablo Hernandez-Leal,et al.  Towards a Fast Detection of Opponents in Repeated Stochastic Games , 2017, AAMAS Workshops.

[30]  Martin V. Butz,et al.  Gradient descent methods in learning classifier systems: improving XCS performance in multistep problems , 2005, IEEE Transactions on Evolutionary Computation.

[31]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[32]  Yujing Hu,et al.  Efficient Deep Reinforcement Learning through Policy Transfer , 2020, AAMAS.

[33]  Martin V. Butz,et al.  An algorithmic description of XCS , 2000, Soft Comput..

[34]  Siyuan Li,et al.  Context-Aware Policy Reuse , 2018, AAMAS.

[35]  Andrea Bonarini,et al.  An Introduction to Learning Fuzzy Classifier Systems , 1999, Learning Classifier Systems.

[36]  Gang Chen,et al.  Accuracy-Based Learning Classifier Systems for Multistep Reinforcement Learning: A Fuzzy Logic Approach to Handling Continuous Inputs and Learning Continuous Actions , 2016, IEEE Transactions on Evolutionary Computation.

[37]  Reinaldo A. C. Bianchi,et al.  Heuristically-Accelerated Multiagent Reinforcement Learning , 2014, IEEE Transactions on Cybernetics.

[38]  Hartmut Schmeck,et al.  Adaption of XCS to multi-learner predator/prey scenarios , 2010, GECCO '10.

[39]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[40]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[41]  Bikramjit Banerjee,et al.  Fast Concurrent Reinforcement Learners , 2001, IJCAI.

[42]  Martin V. Butz,et al.  Function Approximation With XCS: Hyperellipsoidal Conditions, Recursive Least Squares, and Compaction , 2008, IEEE Transactions on Evolutionary Computation.

[43]  A. Martin V. Butz,et al.  The anticipatory classifier system and genetic generalization , 2002, Natural Computing.

[44]  An-Pin Chen,et al.  Refined Group Learning Based on XCS and Neural Network in Intelligent Financial Decision Support System , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[45]  Peter A. Lindsay,et al.  A hierarchical conflict resolution method for multi-agent path planning , 2009, 2009 IEEE Congress on Evolutionary Computation.

[46]  Marcus Gallagher,et al.  Optimality-Based Analysis of XCSF Compaction in Discrete Reinforcement Learning , 2020, PPSN.

[47]  Reinaldo A. C. Bianchi,et al.  Heuristic Selection of Actions in Multiagent Reinforcement Learning , 2007, IJCAI.

[48]  Matthew E. Taylor,et al.  Identifying and Tracking Switching, Non-Stationary Opponents: A Bayesian Approach , 2016, AAAI Workshop: Multiagent Interaction without Prior Coordination.

[49]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[50]  Jeffrey S. Rosenschein,et al.  Best-response multiagent learning in non-stationary environments , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[51]  Jan Drugowitsch,et al.  XCS with eligibility traces , 2005, GECCO '05.

[52]  Dewen Hu,et al.  Multiobjective Reinforcement Learning: A Comprehensive Overview , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[53]  Pablo Hernandez-Leal,et al.  A framework for learning and planning against switching strategies in repeated games , 2014, Connect. Sci..

[54]  Benjamin Rosman,et al.  Bayesian policy reuse , 2015, Machine Learning.

[55]  Gilles Venturini,et al.  Adaptation in dynamic environments through a minimal probability of exploration , 1994 .

[56]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[57]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[58]  Lakhmi C. Jain,et al.  Innovations in Multi-Agent Systems and Applications - 1 , 2010 .

[59]  Paulo Martins Engel,et al.  Dealing with non-stationary environments using context detection , 2006, ICML.

[60]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[61]  William T. B. Uther,et al.  Adversarial Reinforcement Learning , 2003 .

[62]  Tuomas Sandholm,et al.  Game theory-based opponent modeling in large imperfect-information games , 2011, AAMAS.

[63]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[64]  Hao Chen,et al.  XCS with opponent modelling for concurrent reinforcement learners , 2020, Neurocomputing.

[65]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[66]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.