Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess

In this study, hybrid state-action-reward-state-action (SARSA ) and Q-learning algorithms are applied to different stages of an upper confidence bound applied to tree search for Tibetan Jiu chess. Q-learning is also used to update all the nodes on the search path when each game ends. A learning strategy that uses SARSA and Q-learning algorithms combining domain knowledge for a feedback function for layout and battle stages is proposed. An improved deep neural network based on ResNet18 is used for self-play training. Experimental results show that hybrid online and offline reinforcement learning with a deep neural network can improve the game program’s learning efficiency and understanding ability for Tibetan Jiu chess.

[1]  Ernesto Estrada,et al.  Path Laplacian operators and superdiffusive processes on graphs. II. Two-dimensional lattice , 2018, Linear Algebra and its Applications.

[2]  Ivan Bratko,et al.  Pattern-Based Representation of Chess End-Game Knowledge , 1978, Comput. J..

[3]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[4]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[5]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[6]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[7]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[8]  Tuomas Sandholm,et al.  Safe and Nested Subgame Solving for Imperfect-Information Games , 2017, NIPS.

[9]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[11]  Donald F. Beal,et al.  First Results from Using Temporal Difference Learning in Shogi , 1998, Computers and Games.

[12]  Mark H. M. Winands,et al.  Comparing Randomization Strategies for Search-Control Parameters in Monte-Carlo Tree Search , 2019, 2019 IEEE Conference on Games (CoG).

[13]  Jiahui Bai,et al.  On the Observability of Leader-Based Multiagent Systems with Fixed Topology , 2019, Complex..

[14]  Song Wang,et al.  A Reinforcement Learning Model Based on Temporal Difference Algorithm , 2019, IEEE Access.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[17]  Hitoshi Matsubara,et al.  Pattern Recognition for Candidate Generation in the Game of Shogi , 1997 .

[18]  Xia Chen,et al.  A Stochastic Sampling Mechanism for Time-Varying Formation of Multiagent Systems With Multiple Leaders and Communication Delays , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Sylvain Gelly,et al.  Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[20]  Song Wang,et al.  Strategy research based on chess shapes for Tibetan JIU computer game , 2018, J. Int. Comput. Games Assoc..

[21]  Stephen Tyree,et al.  Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , 2016, ICLR.

[22]  Shi-Jim Yen,et al.  Pattern Matching in Go Game Records , 2007, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007).

[23]  Song Wang,et al.  A Middle Game Search Algorithm Applicable to Low-Cost Personal Computer for Go , 2019, IEEE Access.

[24]  Mark H. M. Winands,et al.  Real-Time Monte Carlo Tree Search in Ms Pac-Man , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[25]  Rafal Bogacz,et al.  Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats , 2012, Front. Comput. Neurosci..

[26]  Song Guo,et al.  Information and Communications Technologies for Sustainable Development Goals: State-of-the-Art, Needs and Perspectives , 2018, IEEE Communications Surveys & Tutorials.

[27]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thore Graepel,et al.  Bayesian pattern ranking for move prediction in the game of Go , 2006, ICML.

[30]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[31]  Scott D. Goodwin,et al.  Knowledge Generation for Improving Simulations in UCT for General Game Playing , 2008, Australasian Conference on Artificial Intelligence.

[32]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[33]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[34]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[35]  Song Guo,et al.  Big Data Meet Green Challenges: Big Data Toward Green Applications , 2016, IEEE Systems Journal.

[36]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[37]  Daisuke Takahashi,et al.  A Shogi Program Based on Monte-Carlo Tree Search , 2010, J. Int. Comput. Games Assoc..

[38]  Sebastian Thrun,et al.  Learning to Play the Game of Chess , 1994, NIPS.

[39]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[40]  Yaakov HaCohen-Kerner Learning Strategies for Explanation Patterns: Basic Game Patterns with Application to Chess , 1995, ICCBR.

[41]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[43]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[44]  Razvan Pascanu,et al.  Relational Deep Reinforcement Learning , 2018, ArXiv.

[45]  Gao Long Gradient descent Sarsa(λ) algorithm based on the adaptive potential function shaping reward mechanism , 2013 .

[46]  Xianfu Chen,et al.  Energy-Efficiency Oriented Traffic Offloading in Wireless Networks: A Brief Survey and a Learning Approach for Heterogeneous Cellular Networks , 2015, IEEE Journal on Selected Areas in Communications.

[47]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[48]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.