论文信息 - Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess

Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess

In this study, hybrid state-action-reward-state-action (SARSA ) and Q-learning algorithms are applied to different stages of an upper confidence bound applied to tree search for Tibetan Jiu chess. Q-learning is also used to update all the nodes on the search path when each game ends. A learning strategy that uses SARSA and Q-learning algorithms combining domain knowledge for a feedback function for layout and battle stages is proposed. An improved deep neural network based on ResNet18 is used for self-play training. Experimental results show that hybrid online and offline reinforcement learning with a deep neural network can improve the game program’s learning efficiency and understanding ability for Tibetan Jiu chess.

Xiali Li | Licheng Wu | Yue Zhao | Xiaona Xu | Zhengyu Lv

[1] Ernesto Estrada,et al. Path Laplacian operators and superdiffusive processes on graphs. II. Two-dimensional lattice , 2018, Linear Algebra and its Applications.

[2] Ivan Bratko,et al. Pattern-Based Representation of Chess End-Game Knowledge , 1978, Comput. J..

[3] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[4] Gerald Tesauro,et al. Practical issues in temporal difference learning , 1992, Machine Learning.

[5] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[6] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[7] Murray Campbell,et al. Deep Blue , 2002, Artif. Intell..

[8] Tuomas Sandholm,et al. Safe and Nested Subgame Solving for Imperfect-Information Games , 2017, NIPS.

[9] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10] Terrence J. Sejnowski,et al. Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[11] Donald F. Beal,et al. First Results from Using Temporal Difference Learning in Shogi , 1998, Computers and Games.

[12] Mark H. M. Winands,et al. Comparing Randomization Strategies for Search-Control Parameters in Monte-Carlo Tree Search , 2019, 2019 IEEE Conference on Games (CoG).

[13] Jiahui Bai,et al. On the Observability of Leader-Based Multiagent Systems with Fixed Topology , 2019, Complex..

[14] Song Wang,et al. A Reinforcement Learning Model Based on Temporal Difference Algorithm , 2019, IEEE Access.

[15] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[17] Hitoshi Matsubara,et al. Pattern Recognition for Candidate Generation in the Game of Shogi , 1997 .

[18] Xia Chen,et al. A Stochastic Sampling Mechanism for Time-Varying Formation of Multiagent Systems With Multiple Leaders and Communication Delays , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[19] Sylvain Gelly,et al. Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[20] Song Wang,et al. Strategy research based on chess shapes for Tibetan JIU computer game , 2018, J. Int. Comput. Games Assoc..

[21] Stephen Tyree,et al. Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[22] Shi-Jim Yen,et al. Pattern Matching in Go Game Records , 2007, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007).

[23] Song Wang,et al. A Middle Game Search Algorithm Applicable to Low-Cost Personal Computer for Go , 2019, IEEE Access.

[24] Mark H. M. Winands,et al. Real-Time Monte Carlo Tree Search in Ms Pac-Man , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[25] Rafal Bogacz,et al. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats , 2012, Front. Comput. Neurosci..

[26] Song Guo,et al. Information and Communications Technologies for Sustainable Development Goals: State-of-the-Art, Needs and Perspectives , 2018, IEEE Communications Surveys & Tutorials.

[27] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Thore Graepel,et al. Bayesian pattern ranking for move prediction in the game of Go , 2006, ICML.

[30] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[31] Scott D. Goodwin,et al. Knowledge Generation for Improving Simulations in UCT for General Game Playing , 2008, Australasian Conference on Artificial Intelligence.

[32] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[33] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[34] Noam Brown,et al. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[35] Song Guo,et al. Big Data Meet Green Challenges: Big Data Toward Green Applications , 2016, IEEE Systems Journal.

[36] Sylvain Gelly,et al. Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[37] Daisuke Takahashi,et al. A Shogi Program Based on Monte-Carlo Tree Search , 2010, J. Int. Comput. Games Assoc..

[38] Sebastian Thrun,et al. Learning to Play the Game of Chess , 1994, NIPS.

[39] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.

[40] Yaakov HaCohen-Kerner. Learning Strategies for Explanation Patterns: Basic Game Patterns with Application to Chess , 1995, ICCBR.

[41] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[43] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[44] Razvan Pascanu,et al. Relational Deep Reinforcement Learning , 2018, ArXiv.

[45] Gao Long. Gradient descent Sarsa(λ) algorithm based on the adaptive potential function shaping reward mechanism , 2013 .

[46] Xianfu Chen,et al. Energy-Efficiency Oriented Traffic Offloading in Wireless Networks: A Brief Survey and a Learning Approach for Heterogeneous Cellular Networks , 2015, IEEE Journal on Selected Areas in Communications.

[47] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[48] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.