Mastering 2048 With Delayed Temporal Coherence Learning, Multistage Weight Promotion, Redundant Encoding, and Carousel Shaping

<italic>2048</italic> is an engaging single-player nondeterministic video puzzle game, which, thanks to the simple rules and hard-to-master gameplay, has gained massive popularity in recent years. As <italic>2048</italic> can be conveniently embedded into the discrete-state Markov decision processes framework, we treat it as a testbed for evaluating existing and new methods in reinforcement learning. With the aim to develop a strong <italic>2048</italic> playing program, we employ temporal difference learning with systematic <inline-formula><tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-tuple networks. We show that this basic method can be significantly improved with temporal coherence learning, multistage function approximator with weight promotion, carousel shaping, and redundant encoding. In addition, we demonstrate how to take advantage of the characteristics of the <inline-formula> <tex-math notation="LaTeX">$n$</tex-math></inline-formula>-tuple network, to improve the algorithmic effectiveness of the learning process by delaying the (decayed) update and applying lock-free optimistic parallelism to effortlessly make advantage of multiple CPU cores. This way, we were able to develop the best known <italic>2048</italic> playing program to date, which confirms the effectiveness of the introduced methods for discrete-state Markov decision problems.

[1]  Andrew G. Barto,et al.  Adaptive Step-Size for Online Temporal Difference Learning , 2012, AAAI.

[2]  D. Michie GAME-PLAYING AND GAME-LEARNING AUTOMATA , 1966 .

[3]  J. Albus A Theory of Cerebellar Function , 1971 .

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  Wolfgang Konen,et al.  Temporal difference learning with eligibility traces for the game connect four , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[6]  Wojciech Jaskowski,et al.  High-Dimensional Function Approximation for Knowledge-Free Reinforcement Learning: a Case Study in SZ-Tetris , 2015, GECCO.

[7]  Michael Buro,et al.  From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[8]  Rahul Mehta 2048 is (PSPACE) Hard, but Sometimes Easy , 2014, Electron. Colloquium Comput. Complex..

[9]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[10]  Patrick M. Pilarski,et al.  Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[12]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[13]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[14]  Donald F. Beal,et al.  Temporal Coherence and Prediction Decay in TD Learning , 1999, IJCAI.

[15]  John Levine,et al.  An investigation into 2048 AI strategies , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[16]  Wojciech Jaskowski,et al.  Coevolutionary CMA-ES for Knowledge-Free Learning of Game Position Evaluation , 2016, IEEE Transactions on Computational Intelligence and AI in Games.

[17]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[18]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[19]  Edward P. Manning Using Resource-Limited Nash Memory to Improve an Othello Evaluation Function , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[20]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[21]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[22]  Simon M. Lucas,et al.  Temporal Difference Learning Versus Co-Evolution for Acquiring Othello Position Evaluation , 2006, 2006 IEEE Symposium on Computational Intelligence and Games.

[23]  Wojciech Jaskowski,et al.  On Scalability, Generalization, and Hybridization of Coevolutionary Learning: A Case Study for Othello , 2013, IEEE Transactions on Computational Intelligence and AI in Games.

[24]  Todd W. Neller Pedagogical possibilities for the 2048 puzzle game , 2015 .

[25]  Wojciech Jaskowski Systematic n-Tuple Networks for Othello Position Evaluation , 2014, J. Int. Comput. Games Assoc..

[26]  Wolfgang Konen,et al.  Reinforcement Learning with N-tuples on the Game Connect-4 , 2012, PPSN.

[27]  Simon M. Lucas Learning to Play Othello with N-Tuple Systems , 2008 .

[28]  Jos W. H. M. Uiterwijk,et al.  CHANCEPROBCUT: Forward pruning in chance nodes , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[29]  I-Chen Wu,et al.  Multistage Temporal Difference Learning for 2048-Like Games , 2017, IEEE Transactions on Computational Intelligence and AI in Games.

[30]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[31]  I-Chen Wu,et al.  Multi-Stage Temporal Difference Learning for 2048 , 2014, TAAI.

[32]  R. M. Burstall,et al.  Advances in programming and non-numerical computation , 1967, The Mathematical Gazette.

[33]  Wojciech Jaskowski,et al.  Temporal difference learning of N-tuple networks for the game 2048 , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[34]  Wolfgang Konen,et al.  Online Adaptable Learning Rates for the Game Connect-4 , 2016, IEEE Transactions on Computational Intelligence and AI in Games.

[35]  Simon M. Lucas,et al.  Preference Learning for Move Prediction and Evaluation Function Approximation in Othello , 2014, IEEE Transactions on Computational Intelligence and AI in Games.

[36]  W. W. Bledsoe,et al.  Pattern recognition and reading by machine , 1959, IRE-AIEE-ACM '59 (Eastern).

[37]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[38]  Kiminori Matsuzaki,et al.  Systematic Selection of N-Tuple Networks for 2048 , 2016, Computers and Games.