Solving the Rubik's Cube Without Human Knowledge

A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with self-play have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game, however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik's Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves -- less than or equal to solvers that employ human domain knowledge.

[1]  Gene Cooperman,et al.  Twenty-six moves suffice for Rubik's cube , 2007, ISSAC '07.

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[4]  Otakar Trunda,et al.  Deep Heuristic-learning in the Rubik's Cube Domain: An Experimental Evaluation , 2017, ITAT.

[5]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[6]  Tomas Rokicki,et al.  Twenty-Two Moves Suffice for Rubik’s Cube® , 2010 .

[7]  Malcolm I. Heywood,et al.  The Rubik cube and GP Temporal Sequence learning: An initial study , 2011 .

[8]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[9]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[12]  Tomas Rokicki,et al.  The Diameter of the Rubik's Cube Group Is Twenty , 2013, SIAM J. Discret. Math..

[13]  Malcolm I. Heywood,et al.  Discovering Rubik's Cube Subgroups using Coevolutionary GP: A Five Twist Experiment , 2016, GECCO.

[14]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Léon Bottou,et al.  From machine learning to machine reasoning , 2011, Machine Learning.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Richard E. Korf,et al.  Finding Optimal Solutions to Rubik's Cube Using Pattern Databases , 1997, AAAI/IAAI.

[21]  Calvin Lee,et al.  Rubik’s cube solver , 2018 .

[22]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[25]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[26]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[27]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.