Learning Intrinsically Motivated Options to Stimulate Policy Exploration

A Reinforcement Learning (RL) agent needs to find an optimal sequence of actions in order to maximize rewards. This requires consistent exploration of states and action sequences to ensure the policy found is optimal. One way to motivate exploration is through intrinsic rewards, i.e. agentinduced rewards to guide itself towards interesting behaviors. We propose to learn from such intrinsic rewards through exploration options, i.e. additional temporally-extended actions to call separate policies (or ”Explorer” agents) associated to an intrinsic reward. We show that this method has several key advantages over the usual method of weighted sum of rewards, mainly task-transfer abilities and scalability to multiple reward functions.

[1]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[2]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[3]  Marlos C. Machado,et al.  On Bonus Based Exploration Methods In The Arcade Learning Environment , 2020, ICLR.

[4]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[5]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[6]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[7]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[8]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[9]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Balaraman Ravindran,et al.  Dynamic Action Repetition for Deep Reinforcement Learning , 2017, AAAI.

[12]  Filipo Studzinski Perotto Looking for the Right Time to Shift Strategy in the Exploration-exploitation Dilemma , 2015 .

[13]  Andrew G. Barto,et al.  Intrinsic Motivation and Reinforcement Learning , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[14]  George Konidaris,et al.  Discovering Options for Exploration by Minimizing Cover Time , 2019, ICML.

[15]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[16]  Sergey Levine,et al.  EMI: Exploration with Mutual Information , 2018, ICML.

[17]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18]  Marlos C. Machado,et al.  Count-Based Exploration with the Successor Representation , 2018, AAAI.

[19]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[20]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[21]  Balaraman Ravindran,et al.  Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning , 2017, ICLR.

[22]  Georg Ostrovski,et al.  Temporally-Extended {\epsilon}-Greedy Exploration , 2020, 2006.01782.

[23]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[24]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[25]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[26]  Marcus Hutter,et al.  Count-Based Exploration in Feature Space for Reinforcement Learning , 2017, IJCAI.

[27]  Yonatan Loewenstein,et al.  DORA The Explorer: Directed Outreaching Reinforcement Action-Selection , 2018, ICLR.

[28]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[29]  Salima Hassas,et al.  A survey on intrinsic motivation in reinforcement learning , 2019, ArXiv.

[30]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[33]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Sergey Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[35]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[36]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[37]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[38]  Pieter Abbeel,et al.  Variational Option Discovery Algorithms , 2018, ArXiv.