论文信息 - PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning

PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning

We study reinforcement learning (RL) with noreward demonstrations, a setting in which an RL agent has access to additional data from the interaction of other agents with the same environment. However, it has no access to the rewards or goals of these agents, and their objectives and levels of expertise may vary widely. These assumptions are common in multi-agent settings, such as autonomous driving. To effectively use this data, we turn to the framework of successor features. This allows us to disentangle shared features and dynamics of the environment from agent-specific rewards and policies. We propose a multi-task inverse reinforcement learning (IRL) algorithm, called inverse temporal difference learning (ITD), that learns shared state features, alongside peragent successor features and preference vectors, purely from demonstrations without reward labels. We further show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called ΨΦ-learning (pronounced ‘Sci-Fi’). We provide empirical evidence for the effectiveness of ΨΦlearning as a method for improving RL, IRL, imitation, and few-shot transfer, and derive worstcase bounds for its performance in zero-shot transfer to new tasks.

[1] Doina Precup,et al. The Option Keyboard: Combining Skills in Reinforcement Learning , 2021, NeurIPS.

[2] Sergey Levine,et al. From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following , 2019, ICLR.

[3] Risto Miikkulainen,et al. Evolving explicit opponent models in game playing , 2007, GECCO '07.

[4] Alexey Dosovitskiy,et al. End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[5] Sergey Levine,et al. Deep Imitative Models for Flexible Inference, Planning, and Control , 2018, ICLR.

[6] Peter Englert,et al. Multi-task policy search for robotics , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[7] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[8] David Warde-Farley,et al. Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[9] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[11] Kee-Eung Kim,et al. Inverse Reinforcement Learning in Partially Observable Environments , 2009, IJCAI.

[12] John D. Hunter,et al. Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[13] Matthieu Geist,et al. A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning , 2013, ECML/PKDD.

[14] Olivier Sigaud,et al. Learning compact parameterized skills with a single regression , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[15] Natasha Jaques,et al. Multi-agent Social Reinforcement Learning Improves Generalization , 2020, ArXiv.

[16] Marlos C. Machado,et al. Eigenoption Discovery through the Deep Successor Representation , 2017, ICLR.

[17] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18] Olivier Pietquin,et al. Observational Learning by Reinforcement Learning , 2017, AAMAS.

[19] Martin A. Riedmiller,et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[20] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[21] Tom Schaul,et al. Universal Successor Features Approximators , 2018, ICLR.

[22] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[23] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[24] Richard Tanburn,et al. Making Efficient Use of Demonstrations to Solve Hard Exploration Problems , 2019, ICLR.

[25] Srivatsan Srinivasan,et al. Truly Batch Apprenticeship Learning with Deep Successor Features , 2019, IJCAI.

[26] Christos Dimitrakakis,et al. Bayesian Multitask Inverse Reinforcement Learning , 2011, EWRL.

[27] Pablo Hernandez-Leal,et al. A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[28] Oliver Kroemer,et al. Learning to select and generalize striking movements in robot table tennis , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[29] K. Laland. Darwin's Unfinished Symphony: How Culture Made the Human Mind , 2017 .

[30] Anca D. Dragan,et al. SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards , 2019, ICLR.

[31] Dirk Helbing,et al. Enhanced intelligent driver model to access the impact of driving strategies on traffic capacity , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[32] J. Schulman,et al. Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[33] Yang Gao,et al. Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[34] Sebastian Tschiatschek,et al. Successor Uncertainties: exploration and uncertainty in temporal difference learning , 2018, NeurIPS.

[35] Raymond J. Dolan,et al. Game Theory of Mind , 2008, PLoS Comput. Biol..

[36] Sonia Chernova,et al. Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[37] Sergey Levine,et al. Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[38] Prabhat Nagarajan,et al. Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[39] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[40] Stefan Schaal,et al. Robot Learning From Demonstration , 1997, ICML.

[41] J. Henrich. The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter , 2015 .

[42] Sergey Levine,et al. Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[44] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[45] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[46] D. Stahl. Evolution of Smart n Players , 1991 .

[47] Marlos C. Machado,et al. Count-Based Exploration with the Successor Representation , 2018, AAAI.

[48] Sergio Gomez Colmenarejo,et al. Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[49] Jordan L. Boyd-Graber,et al. Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[50] Doina Precup,et al. Fast reinforcement learning with generalized policy updates , 2020, Proceedings of the National Academy of Sciences.

[51] Shimon Whiteson,et al. Inverse Reinforcement Learning from Failure , 2016, AAMAS.

[52] Abhinav Gupta,et al. Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation , 2018, CoRL.

[53] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[54] Siyuan Liu,et al. Robust Bayesian Inverse Reinforcement Learning with Sparse Behavior Noise , 2014, AAAI.

[55] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[56] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[57] Ion Stoica,et al. Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[58] Nando de Freitas,et al. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[59] Sergey Levine,et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[60] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[61] Sergey Levine,et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[62] Ken Goldberg,et al. Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation , 2017, ICRA.

[63] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[64] Rouhollah Rahmatizadeh,et al. Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[65] Matthew E. Taylor,et al. Agent Modeling as Auxiliary Task for Deep Reinforcement Learning , 2019, AIIDE.

[66] Dean Pomerleau,et al. Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[67] Arthur C. Graesser,et al. Is it an Agent, or Just a Program?: A Taxonomy for Autonomous Agents , 1996, ATAL.

[68] Peter Stone,et al. Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[69] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[70] Sergio Gomez Colmenarejo,et al. One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL , 2018, ArXiv.

[71] Sergey Levine,et al. Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? , 2020, ICML.

[72] Songhwai Oh,et al. Robust Learning From Demonstrations With Mixed Qualities Using Leveraged Gaussian Processes , 2019, IEEE Transactions on Robotics.

[73] Markus Wulfmeier,et al. Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[74] Pieter Abbeel,et al. Learning for control from multiple demonstrations , 2008, ICML '08.

[75] Peter Stone,et al. Behavioral Cloning from Observation , 2018, IJCAI.

[76] J. Andrew Bagnell,et al. Maximum margin planning , 2006, ICML.

[77] Sergey Levine,et al. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[78] Max Jaderberg,et al. Population Based Training of Neural Networks , 2017, ArXiv.

[79] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[80] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[81] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[82] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[83] Nando de Freitas,et al. Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[84] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[85] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[86] Aude Billard,et al. Donut as I do: Learning from failed demonstrations , 2011, 2011 IEEE International Conference on Robotics and Automation.

[87] Tom Heskes,et al. Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learning and a Hierarchical Bayesian Approach , 1998, ICML.