Learning non-myopically from human-generated reward

Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic. In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call "VI-TAMER", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.

[1]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[2]  Farbod Fahimi,et al.  Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[3]  Sonia Chernova,et al.  Effect of human guidance and state space size on Interactive Reinforcement Learning , 2011, 2011 RO-MAN.

[4]  W. Bradley Knox,et al.  Learning from human-generated reward , 2012 .

[5]  David Silver,et al.  Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Achieving Master Level Play in 9 × 9 Computer Go , 2022 .

[6]  Peter Stone,et al.  Cobot in LambdaMOO: An Adaptive Social Statistics Agent , 2006, Autonomous Agents and Multi-Agent Systems.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Leopoldo Altamirano Robles,et al.  Teaching a Robot to Perform Task through Imitation and On-line Feedback , 2011, CIARP.

[9]  Peter Stone,et al.  Reinforcement learning from human reward: Discounting in episodic tasks , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[10]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[13]  Luis Alvarez,et al.  Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications , 2012, Lecture Notes in Computer Science.

[14]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[15]  Peter Stone,et al.  Learning and Using Models , 2012, Reinforcement Learning.

[16]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[17]  Eduardo F. Morales,et al.  Dynamic Reward Shaping: Training a Robot by Voice , 2010, IBERAMIA.