论文信息 - A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback

A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback

This paper introduces two novel algorithms for learning behaviors from human-provided rewards. The primary novelty of these algorithms is that instead of treating the feedback as a numeric reward signal, they interpret feedback as a form of discrete communication that depends on both the behavior the trainer is trying to teach and the teaching strategy used by the trainer. For example, some human trainers use a lack of feedback to indicate whether actions are correct or incorrect, and interpreting this lack of feedback accurately can significantly improve learning speed. Results from user studies show that humans use a variety of training strategies in practice and both algorithms can learn a contextual bandit task faster than algorithms that treat the feed-back as numeric. Simulated trainers are also employed to evaluate the algorithms in both contextual bandit and sequential decision-making tasks with similar results.

[1] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[3] Peter Stone,et al. A social reinforcement learning agent , 2001, AGENTS '01.

[4] Jeffrey Heer,et al. Presiding over accidents: system direction of human action , 2004, CHI.

[5] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6] Andrea Lockerd Thomaz,et al. Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[7] Paul R. Schrater,et al. Bayesian modeling of human sequential decision-making on the multi-armed bandit problem , 2008 .

[8] Peter Stone,et al. Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[9] Martin Pál,et al. Contextual Multi-Armed Bandits , 2010, AISTATS.

[10] Bilge Mutlu,et al. How Do Humans Teach: On Curriculum Learning and Teaching Dimension , 2011, NIPS.

[11] Christopher M. Anderson. Ambiguity aversion in multi-armed bandit problems , 2012 .

[12] Manuel Lopes,et al. Algorithmic and Human Teaching of Sequential Decision Tasks , 2012, AAAI.

[13] Bradley C. Love,et al. A New Experimental Perspective , 2012 .

[14] Andrea Lockerd Thomaz,et al. Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.