Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization

The problem of inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration. Despite significant algorithmic contributions in recent years, IRL remains an ill-posed problem at its core; multiple reward functions coincide with the observed behavior and the actual reward function is not identifiable without prior knowledge or supplementary information. This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions that are consistent with the expert demonstrations by efficiently exploring the reward function space. BO-IRL achieves this by utilizing Bayesian Optimization along with our newly proposed kernel that (a) projects the parameters of policy invariant reward functions to a single point in a latent space and (b) ensures nearby points in the latent space correspond to reward functions yielding similar likelihoods. This projection allows the use of standard stationary kernels in the latent space to capture the correlations present across the reward function space. Empirical results on synthetic and real-world environments (model-free and model-based) show that BO-IRL discovers multiple reward functions while minimizing the number of expensive exact policy optimizations.

[1]  Marcello Restelli,et al.  Compatible Reward Inverse Reinforcement Learning , 2017, NIPS.

[2]  Kian Hsiang Low,et al.  Inverse Reinforcement Learning with Locally Consistent Reward Functions , 2015, NIPS.

[3]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[4]  Kian Hsiang Low,et al.  Distributed Batch Gaussian Process Optimization , 2017, ICML.

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Scott Niekum,et al.  Deep Bayesian Reward Learning from Preferences , 2019, ArXiv.

[7]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[8]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[9]  Kian Hsiang Low,et al.  Decentralized High-Dimensional Bayesian Optimization with Factor Graphs , 2017, AAAI.

[10]  Manuel Lopes,et al.  Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[11]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[12]  Kian Hsiang Low,et al.  Gaussian Process Planning with Lipschitz Continuous Reward Functions: Towards Unifying Bayesian Optimization, Active Learning, and Beyond , 2015, AAAI.

[13]  Dana Kulic,et al.  Expectation-Maximization for Inverse Reinforcement Learning with Hidden Data , 2016, AAMAS.

[14]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[15]  Kareem Amin,et al.  Towards Resolving Unidentifiability in Inverse Reinforcement Learning , 2016, ArXiv.

[16]  Nando de Freitas,et al.  A Bayesian interactive optimization approach to procedural animation design , 2010, SCA '10.

[17]  Sam Devlin,et al.  Policy invariance under reward transformations for multi-objective reinforcement learning , 2017, Neurocomputing.

[18]  Oliver Kroemer,et al.  Structured Apprenticeship Learning , 2012, ECML/PKDD.

[19]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[20]  Michael A. Osborne,et al.  Gaussian Processes for Global Optimization , 2008 .

[21]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[22]  D. Lizotte Practical bayesian optimization , 2008 .

[23]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[24]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[25]  Anders Karlström,et al.  A link based network route choice model with unrestricted choice set , 2013 .

[26]  Bing Cai Kok,et al.  Trust in Robots: Challenges and Opportunities , 2020, Current Robotics Reports.

[27]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[28]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[29]  Kian Hsiang Low,et al.  Federated Bayesian Optimization via Thompson Sampling , 2020, NeurIPS.

[30]  Nan Jiang,et al.  Repeated Inverse Reinforcement Learning , 2017, NIPS.

[31]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[32]  Cheng Li,et al.  High Dimensional Bayesian Optimization with Elastic Gaussian Process , 2017, ICML.

[33]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[34]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[35]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[36]  Scott Niekum,et al.  Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications , 2018, AAAI.

[37]  Kian Hsiang Low,et al.  Nonmyopic Gaussian Process Optimization with Macro-Actions , 2020, AISTATS.