论文信息 - Learning Rewards from Linguistic Feedback

Learning Rewards from Linguistic Feedback

We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g. commands). We propose a general framework which does not make this assumption. We decompose linguistic feedback into two components: a grounding to $\textit{features}$ of a Markov decision process and $\textit{sentiment}$ about those features. We then perform an analogue of inverse reinforcement learning, regressing the teacher's sentiment on the features to infer their latent reward function. To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We use our framework to implement two artificial learners: a simple "literal" model and a "pragmatic" model with additional inductive biases. We baseline these with a neural network trained end-to-end to predict latent rewards. We then repeat our initial experiment pairing human teachers with our models. We find our "literal" and "pragmatic" models successfully learn from live human feedback and offer statistically-significant performance gains over the end-to-end baseline, with the "pragmatic" model approaching human performance on the task. Inspection reveals the end-to-end network learns representations similar to our models, suggesting they reflect emergent properties of the data. Our work thus provides insight into the information structure of naturalistic linguistic feedback as well as methods to leverage it for reinforcement learning.

[1] J. Tukey. Some selected quick and easy methods of statistical analysis. , 1953, Transactions of the New York Academy of Sciences.

[2] D. McFadden. Conditional logit analysis of qualitative choice behavior , 1972 .

[3] Jude W. Shavlik,et al. Incorporating Advice into Agents that Learn from Reinforcements , 1994, AAAI.

[4] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5] Larry Ambrose,et al. The power of feedback. , 2002, Healthcare executive.

[6] K. Train. Discrete Choice Methods with Simulation , 2003 .

[7] Bing Liu,et al. Mining and summarizing customer reviews , 2004, KDD.

[8] Gregory Kuhlmann and Peter Stone and Raymond J. Mooney and Shavlik. Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer , 2004, AAAI 2004.

[9] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[10] Siobhan Chapman. Logic and Conversation , 2005 .

[11] S. Harnad. Symbol grounding problem , 1991, Scholarpedia.

[12] V. Shute. Focus on Formative Feedback , 2007 .

[13] Lillian Lee,et al. Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[14] Eyal Amir,et al. Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[15] Raymond J. Mooney,et al. Learning to Connect Language and Perception , 2008, AAAI.

[16] Andrea Lockerd Thomaz,et al. Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[17] Peter Stone,et al. Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[18] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[19] Jeffrey K. Smith,et al. Effects of differential feedback on students' examination performance. , 2009, Journal of experimental psychology. Applied.

[20] Brian Scassellati,et al. How people talk when teaching a robot , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21] J. Andrew Bagnell,et al. Efficient Reductions for Imitation Learning , 2010, AISTATS.

[22] Thomas G. Dietterich,et al. Reinforcement Learning Via Practice and Critique Advice , 2010, AAAI.

[23] Sven Lauer,et al. Modeling Expert Effects and Common Ground Using Questions Under Discussion , 2011, AAAI Fall Symposium: Building Representations of Common Ground with Intelligent Agents.

[24] Luke S. Zettlemoyer,et al. Bootstrapping Semantic Parsers from Conversations , 2011, EMNLP.

[25] Tiejun Zhao,et al. Target-dependent Twitter Sentiment Classification , 2011, ACL.

[26] Matthew R. Walter,et al. Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[27] Sven Lauer,et al. Corpus Evidence for Preference-Driven Interpretation , 2011, Amsterdam Colloquium on Logic, Language and Meaning.

[28] Christopher Potts,et al. Goal-Driven Answers in the CardsDialogue Corpus , 2012 .

[29] N. Arnett. Goal-driven Answers in the Cards Dialogue Corpus , 2012 .

[30] Lei Zhang,et al. A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[31] Siddhartha S. Srinivasa,et al. Legibility and predictability of robot motion , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[32] Fabienne M. Van der Kleij,et al. Effects of Feedback in a Computer-Based Learning Environment on Students’ Learning Outcomes , 2013 .