MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning

Exploration in reinforcement learning is a challenging problem: in the worst case, the agent must search for high-reward states that could be hidden anywhere in the state space. Can we define a more tractable class of RL problems, where the agent is provided with examples of successful outcomes? In this problem setting, the reward function can be obtained automatically by training a classifier to categorize states as successful or not. If trained properly, such a classifier can provide a well-shaped objective landscape that both promotes progress toward good states and provides a calibrated exploration bonus. In this work, we show that an uncertainty aware classifier can solve challenging reinforcement learning problems by both encouraging exploration and provided directed guidance towards positive outcomes. We propose a novel mechanism for obtaining these calibrated, uncertainty-aware classifiers based on an amortized technique for computing the normalized maximum likelihood (NML) distribution. To make this tractable, we propose a novel method for computing the NML distribution by using meta-learning. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions, while also providing more effective guidance towards the goal. We demonstrate that our algorithm solves a number of challenging navigation and robotic manipulation tasks which prove difficult or impossible for prior methods. Equal contribution Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, USA. Correspondence to: Kevin Li <kevintli@berkeley.edu>, Abhishek Gupta <abhigupta@berkeley.edu>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Figure 1. MURAL: Our method trains an uncertainty-aware classifier based on user-provided examples of successful outcomes. Appropriate uncertainty in the classifier, obtained via a meta-learning based estimator for the normalized maximum likelihood (NML) distribution, automatically incentivizes exploration and provides reward shaping for RL. This can solve complex robotic manipulation and navigation tasks as shown here.

[1]  Sergey Levine,et al.  Amortized Conditional Normalized Maximum Likelihood , 2020, ArXiv.

[2]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[3]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[4]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[5]  Satinder Singh,et al.  Many-Goals Reinforcement Learning , 2018, ArXiv.

[6]  Lorenzo Natale,et al.  Learning latent state representation for speeding up exploration , 2019, ArXiv.

[7]  Sergey Levine,et al.  Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery , 2019, ArXiv.

[8]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[9]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[10]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[11]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[12]  S. Levine,et al.  Learning To Reach Goals Without Reinforcement Learning , 2019, ArXiv.

[13]  Pierre-Yves Oudeyer,et al.  CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning , 2018, ICML.

[14]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[15]  Sergey Levine,et al.  Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition , 2018, NeurIPS.

[16]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[17]  Jun Zhang Model Selection with Informative Normalized Maximum Likelihood: Data Prior and Model Prior , 2011 .

[18]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[19]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[20]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[21]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[22]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[23]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[24]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[25]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[26]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[27]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[28]  Sergey Levine,et al.  The Ingredients of Real-World Robotic Reinforcement Learning , 2020, ICLR.

[29]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[30]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[31]  Meir Feder,et al.  Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks , 2019, ArXiv.

[32]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[33]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[34]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[35]  Filipe Wall Mutz,et al.  Hindsight policy gradients , 2017, ICLR.

[36]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[39]  Brendan O'Donoghue,et al.  Variational Bayesian Reinforcement Learning with Regret Bounds , 2018, NeurIPS.

[40]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[41]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[42]  Justin Fu,et al.  EX2: Exploration with Exemplar Models for Deep Reinforcement Learning , 2017, NIPS.

[43]  Tom Schaul,et al.  Curiosity-driven optimization , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[44]  Sergey Levine,et al.  End-to-End Robotic Reinforcement Learning without Reward Engineering , 2019, Robotics: Science and Systems.

[45]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[46]  J. Rissanen,et al.  Conditional NML Universal Models , 2007, 2007 Information Theory and Applications Workshop.

[47]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[48]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[49]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[50]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[51]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[52]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[53]  Meir Feder,et al.  Universal Batch Learning with Log-Loss , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[54]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.