Human Preference Scaling with Demonstrations For Deep Reinforcement Learning

The current reward learning from human preferences could be used for resolving complex reinforcement learning (RL) tasks without access to the reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgement of preferences between trajectories is not dynamic and still requires human inputs over 1,000 times. In this study, we propose a human preference scaling model that naturally reflects the human perception of the degree of choice between trajectories and then develop a human-demonstration preference model via supervised learning to reduce the number of human inputs. The proposed human preference scaling model with demonstrations can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion - MuJoCo games - relative to the single fixed human preferences. Furthermore, our developed human-demonstration preference model only needs human feedback for less than 0.01\% of the agent's interactions with the environment and significantly reduces up to 30\% of the cost of human inputs compared to the existing approaches. To present the flexibility of our approach, we released a video (this https URL) showing comparisons of behaviours of agents trained with different types of human inputs. We believe that our naturally inspired human preference scaling with demonstrations is beneficial for precise reward learning and can potentially be applied to state-of-the-art RL systems, such as autonomy-level driving systems.

[1]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[2]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[3]  Laurel D. Riek,et al.  Preference Learning in Assistive Robotics: Observational Repeated Inverse Reinforcement Learning , 2018, MLHC.

[4]  Chin-Teng Lin,et al.  Reinforcement Learning From Hierarchical Critics , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[6]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[7]  Chin-Teng Lin,et al.  Hierarchical and Non-Hierarchical Multi-Agent Interactions Based on Unity Reinforcement Learning , 2020, AAMAS.

[8]  Dana Kulic,et al.  Expectation-Maximization for Inverse Reinforcement Learning with Hidden Data , 2016, AAMAS.

[9]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[10]  David M. Kreps Notes On The Theory Of Choice , 1988 .

[11]  Yannick Schroecker,et al.  State Aware Imitation Learning , 2017, NIPS.

[12]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[13]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[14]  Zehong Cao,et al.  Enhancing Transferability of Deep Reinforcement Learning-Based Variable Speed Limit Control Using Transfer Learning , 2020, IEEE Transactions on Intelligent Transportation Systems.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[17]  Stuart Russell Should We Fear Supersmart Robots? , 2016, Scientific American.

[18]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[19]  Fuyuan Xiao,et al.  A Novel Conflict Measurement in Decision-Making and Its Application in Fault Diagnosis , 2020, IEEE Transactions on Fuzzy Systems.

[20]  Stefano Ermon,et al.  Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[21]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..

[22]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.