Hyperparameter Selection for Imitation Learning

We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms in the context of continuous-control, when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, but this is not a realistic setting. Indeed, would this reward function be available, it could then directly be used for policy training and imitation would not be necessary. To tackle this mostly ignored problem, we propose a number of possible proxies to the external reward. We evaluate them in an extensive empirical study (more than 10’000 agents across 9 environments) and make practical recommendations for selecting HPs. Our results show that while imitation learning algorithms are sensitive to HP choices, it is often possible to select good enough HPs through a proxy to the reward function.

[1]  Marcin Andrychowicz,et al.  What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study , 2020, ArXiv.

[2]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[3]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[4]  Matthieu Geist,et al.  Primal Wasserstein Imitation Learning , 2020, ICLR.

[5]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[6]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[7]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[8]  Robert Feldt,et al.  Generating diverse software versions with genetic programming: and experimental study , 1998, IEE Proc. Softw..

[9]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[10]  Vikash Kumar,et al.  Manipulators and Manipulation in high dimensional spaces , 2016 .

[11]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[12]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[13]  Karl Sims,et al.  Evolving virtual creatures , 1994, SIGGRAPH.

[14]  Sergey Levine,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[15]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[16]  Matthieu Geist,et al.  Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[17]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[18]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[19]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[20]  C. Villani Optimal Transport: Old and New , 2008 .

[21]  Larry Rudolph,et al.  Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO , 2020, ArXiv.

[22]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[23]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[24]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[25]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[26]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[27]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[28]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[29]  Yiannis Demiris,et al.  Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation , 2019, ICML.

[30]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[31]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[32]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[33]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[34]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[35]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[36]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Honglak Lee,et al.  Predictive Information Accelerates Learning in RL , 2020, NeurIPS.

[39]  Kenneth O. Stanley,et al.  First return then explore , 2021, Nature.

[40]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[41]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[42]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[43]  Matthieu Geist,et al.  Learning from Demonstrations: Is It Worth Estimating a Reward Function? , 2013, ECML/PKDD.

[44]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[45]  Nando de Freitas,et al.  Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.

[46]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[47]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[48]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[49]  Pieter Abbeel,et al.  Decoupling Representation Learning from Reinforcement Learning , 2020, ICML.

[50]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..