An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

What is a good exploration strategy for an agent that interacts with an environment in the absence of external rewards? Ideally, we would like to get a policy driving towards a uniform state-action visitation (highly exploring) in a minimum number of steps (fast mixing), in order to ease efficient learning of any goal-conditioned policy later on. Unfortunately, it is remarkably arduous to directly learn an optimal policy of this nature. In this paper, we propose a novel surrogate objective for learning highly exploring and fast mixing policies, which focuses on maximizing a lower bound to the entropy of the steady-state distribution induced by the policy. In particular, we introduce three novel lower bounds, that lead to as many optimization problems, that tradeoff the theoretical guarantees with computational complexity. Then, we present a model-based reinforcement learning algorithm, IDE$^{3}$AL, to learn an optimal policy according to the introduced objective. Finally, we provide an empirical evaluation of this algorithm on a set of hard-exploration tasks.

[1]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[2]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[3]  V. Climenhaga Markov chains and mixing times , 2013 .

[4]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[5]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[6]  Peter Auer,et al.  Autonomous Exploration For Navigating In MDPs , 2012, COLT.

[7]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[8]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[9]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[10]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[11]  J. Hunter Some stochastic properties of semi-magic and magic Markov chains , 2010 .

[12]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[13]  Alexandros G. Dimakis,et al.  Optimized Markov Chain Monte Carlo for Signal Detection in MIMO Systems: An Analysis of the Stationary Distribution and Mixing Time , 2013, IEEE Transactions on Signal Processing.

[14]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[15]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[16]  Andrea Bonarini,et al.  Incremental Skill Acquisition for Self-motivated Learning Animats , 2006, SAB.

[17]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[20]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[21]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[22]  S. Kirkland COLUMN SUMS AND THE CONDITIONING OF THE STATIONARY DISTRIBUTION FOR A STOCHASTIC MATRIX , 2010 .

[23]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[24]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[26]  Imre Csiszár,et al.  Context tree estimation for not necessarily finite memory processes, via BIC and MDL , 2005, IEEE Transactions on Information Theory.

[27]  Stephen P. Boyd,et al.  Fastest Mixing Markov Chain on a Graph , 2004, SIAM Rev..

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  Alessandro Lazaric,et al.  Active Exploration in Markov Decision Processes , 2019, AISTATS.

[31]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[32]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[33]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[34]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[35]  Martin Grötschel,et al.  The Ellipsoid Method , 1993 .

[36]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[37]  David Barber,et al.  Variational methods for Reinforcement Learning , 2010, AISTATS.

[38]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[39]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[40]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[41]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[42]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[43]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[44]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.