Behavior Priors for Efficient Reinforcement Learning

As we deploy reinforcement learning agents to solve increasingly challenging problems, methods that allow us to inject prior knowledge about the structure of the world and effective solution strategies becomes increasingly important. In this work we consider how information and architectural constraints can be combined with ideas from the probabilistic modeling literature to learn behavior priors that capture the common movement and interaction patterns that are shared across a set of related tasks or contexts. For example the day-to day behavior of humans comprises distinctive locomotion and manipulation patterns that recur across many different situations and goals. We discuss how such behavior patterns can be captured using probabilistic trajectory models and how these can be integrated effectively into reinforcement learning schemes, e.g.\ to facilitate multi-task and transfer learning. We then extend these ideas to latent variable models and consider a formulation to learn hierarchical priors that capture different aspects of the behavior in reusable modules. We discuss how such latent variable formulations connect to related work on hierarchical reinforcement learning (HRL) and mutual information and curiosity based objectives, thereby offering an alternative perspective on existing ideas. We demonstrate the effectiveness of our framework by applying it to a range of simulated continuous control domains.

[1]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[2]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[3]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[4]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[7]  David Barber,et al.  An Auxiliary Variational Method , 2004, ICONIP.

[8]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[9]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[10]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[11]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[12]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[13]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[14]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[15]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[16]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[17]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[18]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[19]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Jan Peters,et al.  Learning movement primitive libraries through probabilistic segmentation , 2017, Int. J. Robotics Res..

[21]  Tatiana V. Guy,et al.  Decision Making with Imperfect Decision Makers , 2011 .

[22]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[23]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[24]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[25]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[26]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[27]  Jan Peters,et al.  Probabilistic Movement Primitives , 2013, NIPS.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  René Boel,et al.  Discrete event dynamic systems: Theory and applications. , 2002 .

[30]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[31]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[32]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[33]  Sergey Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[34]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[35]  Ronen I. Brafman,et al.  Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning , 1997, IJCAI.

[36]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[37]  Daniel A. Braun,et al.  Information, Utility and Bounded Rationality , 2011, AGI.

[38]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Daniel A. Braun,et al.  Thermodynamics as a theory of decision-making with information-processing costs , 2012, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[41]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[42]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[43]  Ryan P. Adams,et al.  Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[44]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[45]  Ion Stoica,et al.  DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations , 2017, CoRL.

[46]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[47]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[48]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[49]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[50]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[51]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[52]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[53]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[54]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[55]  Dushyant Rao,et al.  Data-efficient Hindsight Off-policy Option Learning , 2020, ArXiv.

[56]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[57]  SinghSatinder,et al.  Between MDPs and semi-MDPs , 1999 .

[58]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[59]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[60]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[61]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[62]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[63]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[64]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[65]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[66]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[67]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[68]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[69]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[70]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[71]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[72]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[73]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[74]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[75]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[76]  Naftali Tishby,et al.  Trading Value and Information in MDPs , 2012 .

[77]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference (Extended Abstract) , 2013, IJCAI.

[78]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[79]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[80]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[81]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[82]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[83]  M. Botvinick,et al.  Mental labour , 2018, Nature Human Behaviour.

[84]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[85]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[86]  Joshua B. Tenenbaum,et al.  Learning to Share and Hide Intentions using Information Regularization , 2018, NeurIPS.

[87]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[88]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[89]  H. Simon,et al.  Rational choice and the structure of the environment. , 1956, Psychological review.

[90]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[91]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[92]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[93]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[94]  Pieter Abbeel,et al.  Policy transfer via modularity and reward guiding , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[95]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[96]  J. Urgen Schmidhuber,et al.  Neural sequence chunkers , 1991, Forschungsberichte, TU Munich.

[97]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[98]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[99]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[100]  Chrystopher L. Nehaniv,et al.  Empowerment: a universal agent-centric measure of control , 2005, 2005 IEEE Congress on Evolutionary Computation.

[101]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[102]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[103]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[104]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[105]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[106]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[107]  Naftali Tishby,et al.  A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes , 2017, ArXiv.

[108]  Wojciech Zaremba,et al.  Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model , 2016, ArXiv.

[109]  Yee Whye Teh,et al.  Transferring Task Goals via Hierarchical Reinforcement Learning , 2018 .

[110]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[111]  Sergey Levine,et al.  InfoBot: Transfer and Exploration via the Information Bottleneck , 2019, ICLR.

[112]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[113]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.