论文信息 - Learning non-myopically from human-generated reward

Learning non-myopically from human-generated reward

Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic. In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call "VI-TAMER", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.

Peter Stone | W. Bradley Knox | P. Stone | W. B. Knox

[1] Peter Stone,et al. Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[2] Farbod Fahimi,et al. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[3] Sonia Chernova,et al. Effect of human guidance and state space size on Interactive Reinforcement Learning , 2011, 2011 RO-MAN.

[4] W. Bradley Knox,et al. Learning from human-generated reward , 2012 .

[5] David Silver,et al. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Achieving Master Level Play in 9 × 9 Computer Go , 2022 .

[6] Peter Stone,et al. Cobot in LambdaMOO: An Adaptive Social Statistics Agent , 2006, Autonomous Agents and Multi-Agent Systems.

[7] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[8] Leopoldo Altamirano Robles,et al. Teaching a Robot to Perform Task through Imitation and On-line Feedback , 2011, CIARP.

[9] Peter Stone,et al. Reinforcement learning from human reward: Discounting in episodic tasks , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[10] Andrea Lockerd Thomaz,et al. Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[13] Luis Alvarez,et al. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications , 2012, Lecture Notes in Computer Science.

[14] Neil D. Lawrence,et al. Missing Data in Kernel PCA , 2006, ECML.

[15] Peter Stone,et al. Learning and Using Models , 2012, Reinforcement Learning.

[16] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[17] Eduardo F. Morales,et al. Dynamic Reward Shaping: Training a Robot by Voice , 2010, IBERAMIA.