Learning Rewards from Linguistic Feedback

We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g. commands). We propose a general framework which does not make this assumption. We decompose linguistic feedback into two components: a grounding to $\textit{features}$ of a Markov decision process and $\textit{sentiment}$ about those features. We then perform an analogue of inverse reinforcement learning, regressing the teacher's sentiment on the features to infer their latent reward function. To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We use our framework to implement two artificial learners: a simple "literal" model and a "pragmatic" model with additional inductive biases. We baseline these with a neural network trained end-to-end to predict latent rewards. We then repeat our initial experiment pairing human teachers with our models. We find our "literal" and "pragmatic" models successfully learn from live human feedback and offer statistically-significant performance gains over the end-to-end baseline, with the "pragmatic" model approaching human performance on the task. Inspection reveals the end-to-end network learns representations similar to our models, suggesting they reflect emergent properties of the data. Our work thus provides insight into the information structure of naturalistic linguistic feedback as well as methods to leverage it for reinforcement learning.

[1]  J. Tukey Some selected quick and easy methods of statistical analysis. , 1953, Transactions of the New York Academy of Sciences.

[2]  D. McFadden Conditional logit analysis of qualitative choice behavior , 1972 .

[3]  Jude W. Shavlik,et al.  Incorporating Advice into Agents that Learn from Reinforcements , 1994, AAAI.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Larry Ambrose,et al.  The power of feedback. , 2002, Healthcare executive.

[6]  K. Train Discrete Choice Methods with Simulation , 2003 .

[7]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[8]  Gregory Kuhlmann and Peter Stone and Raymond J. Mooney and Shavlik Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer , 2004, AAAI 2004.

[9]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[10]  Siobhan Chapman Logic and Conversation , 2005 .

[11]  S. Harnad Symbol grounding problem , 1991, Scholarpedia.

[12]  V. Shute Focus on Formative Feedback , 2007 .

[13]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[14]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[15]  Raymond J. Mooney,et al.  Learning to Connect Language and Perception , 2008, AAAI.

[16]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[17]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[18]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[19]  Jeffrey K. Smith,et al.  Effects of differential feedback on students' examination performance. , 2009, Journal of experimental psychology. Applied.

[20]  Brian Scassellati,et al.  How people talk when teaching a robot , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[22]  Thomas G. Dietterich,et al.  Reinforcement Learning Via Practice and Critique Advice , 2010, AAAI.

[23]  Sven Lauer,et al.  Modeling Expert Effects and Common Ground Using Questions Under Discussion , 2011, AAAI Fall Symposium: Building Representations of Common Ground with Intelligent Agents.

[24]  Luke S. Zettlemoyer,et al.  Bootstrapping Semantic Parsers from Conversations , 2011, EMNLP.

[25]  Tiejun Zhao,et al.  Target-dependent Twitter Sentiment Classification , 2011, ACL.

[26]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[27]  Sven Lauer,et al.  Corpus Evidence for Preference-Driven Interpretation , 2011, Amsterdam Colloquium on Logic, Language and Meaning.

[28]  Christopher Potts,et al.  Goal-Driven Answers in the CardsDialogue Corpus , 2012 .

[29]  N. Arnett Goal-driven Answers in the Cards Dialogue Corpus , 2012 .

[30]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[31]  Siddhartha S. Srinivasa,et al.  Legibility and predictability of robot motion , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[32]  Fabienne M. Van der Kleij,et al.  Effects of Feedback in a Computer-Based Learning Environment on Students’ Learning Outcomes , 2013 .

[33]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[34]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[35]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[36]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[37]  Kiran Bhowmick,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2015 .

[38]  Smaranda Muresan,et al.  Grounding English Commands to Reward Functions , 2015, Robotics: Science and Systems.

[39]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[40]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[41]  Jessica B. Hamrick,et al.  psiTurk: An open-source framework for conducting replicable behavioral experiments online , 2016, Behavior research methods.

[42]  Michael C. Frank,et al.  Review Pragmatic Language Interpretation as Probabilistic Inference , 2022 .

[43]  Fiery Cushman,et al.  Showing versus doing: Teaching by demonstration , 2016, NIPS.

[44]  Tom M. Mitchell,et al.  Joint Concept Learning and Semantic Parsing from Natural Language Explanations , 2017, EMNLP.

[45]  Jason Weston,et al.  Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[46]  Chris Sauer,et al.  Beating Atari with Natural Language Guided Reinforcement Learning , 2017, ArXiv.

[47]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[48]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[49]  Xiang Zhou,et al.  On-line Dialogue Policy Learning with Companion Teaching , 2017, EACL.

[50]  Sanja Fidler,et al.  Teaching Machines to Describe Images via Natural Language Feedback , 2017, ArXiv.

[51]  Per B. Brockhoff,et al.  lmerTest Package: Tests in Linear Mixed Effects Models , 2017 .

[52]  Percy Liang,et al.  Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings , 2017, ACL.

[53]  Thien Huu Nguyen,et al.  BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop , 2018, ArXiv.

[54]  Stefanie Tellex,et al.  Learning to Parse Natural Language to Grounded Reward Functions with Weak Supervision , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[55]  Christopher Ré,et al.  Training Classifiers with Natural Language Explanations , 2018, ACL.

[56]  Katja Hofmann,et al.  How Players Speak to an Intelligent Game Character Using Natural Language Messages , 2018, Trans. Digit. Games Res. Assoc..

[57]  Regina Barzilay,et al.  Grounding Language for Transfer in Deep Reinforcement Learning , 2017, J. Artif. Intell. Res..

[58]  Prasoon Goyal,et al.  Using Natural Language for Reward Shaping in Reinforcement Learning , 2019, IJCAI.

[59]  Jesse Thomason,et al.  Vision-and-Dialog Navigation , 2019, CoRL.

[60]  David Schlangen,et al.  MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment , 2019, ArXiv.

[61]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[62]  Sergey Levine,et al.  From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following , 2019, ICLR.

[63]  Shimon Whiteson,et al.  A Survey of Reinforcement Learning Informed by Natural Language , 2019, IJCAI.

[64]  Philip S. Yu,et al.  BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , 2019, NAACL.

[65]  Jason Weston,et al.  Why Build an Assistant in Minecraft? , 2019, ArXiv.

[66]  Akiko Aizawa,et al.  A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context , 2019, AAAI.

[67]  Yoav Artzi,et al.  Executing Instructions in Situated Collaborative Interactions , 2019, EMNLP.

[68]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  O. Pietquin,et al.  Self-Educated Language Agent with Hindsight Experience Replay for Instruction Following , 2019, ViGIL@NeurIPS.

[70]  Hong Jun Jeon,et al.  Reward-rational (implicit) choice: A unifying formalism for reward learning , 2020, NeurIPS.

[71]  Peter Stone,et al.  Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog , 2020, J. Artif. Intell. Res..

[72]  R. Mooney,et al.  PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards , 2020, CoRL.

[73]  Hadas Kress-Gazit,et al.  Robots That Use Language , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[74]  Kevin Small,et al.  Inverse Reinforcement Learning with Natural Language Goals , 2020, AAAI.