论文信息 - Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog - 字舞流文

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

Natasha Jaques | Rosalind W. Picard | Shixiang Gu | Craig Ferguson | Judy Hanwen Shen | Rosalind Picard | Asma Ghandeharioun | Agata Lapedriza | Noah Jones | S. Gu | À. Lapedriza | Natasha Jaques | Asma Ghandeharioun | Craig Ferguson | Noah J. Jones | J. Shen | Àgata Lapedriza

[1] M. K rn,et al. Stochastic Optimal Control , 1988 .

[2] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.

[3] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4] Jennifer Hay. Functions of humor in the conversations of men and women , 2000 .

[5] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[6] Candace L. Sidner,et al. Where to look: a study of human-robot engagement , 2004, IUI '04.

[7] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[8] Emanuel Todorov,et al. Linearly-solvable Markov decision problems , 2006, NIPS.

[9] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[10] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[11] Harry Weger,et al. Active Listening in Peer Interviews: The Influence of Message Paraphrasing on Perceptions of Listening Skill , 2010 .

[12] Lauren E. Scissors,et al. Language Style Matching Predicts Relationship Initiation and Stability , 2011, Psychological science.

[13] Cristian Danescu-Niculescu-Mizil,et al. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[14] Milica Gasic,et al. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[16] Marc Toussaint,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[17] G. Bodie,et al. Listening Competence in Initial Interactions I: Distinguishing Between What Listening Is and What Listeners Do , 2012 .

[18] Vicenç Gómez,et al. Optimal control as a graphical model inference problem , 2009, Machine Learning.

[19] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[20] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[21] Marco Pavone,et al. Stochastic Optimal Control , 2015 .

[22] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[23] Andrea J. Vickery,et al. The Role of “Active Listening” in Informal Helping Conversations: Impact on Perceptions of Listener Helpfulness, Sensitivity, and Supportiveness and Discloser Emotional Improvement , 2015 .

[24] Jianfeng Gao,et al. Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[25] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[26] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[27] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[28] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[29] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[30] Jing He,et al. Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[31] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[32] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[33] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[34] Alan Ritter,et al. Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[35] Iyad Rahwan,et al. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[36] Jason Weston,et al. Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[37] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[38] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[39] Stefan Ultes,et al. Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management , 2017, SIGDIAL Conference.

[40] Joelle Pineau,et al. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[41] Joelle Pineau,et al. A Deep Reinforcement Learning Chatbot , 2017, ArXiv.

[42] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[43] Bing Liu,et al. Iterative policy learning in end-to-end trainable task-oriented neural dialog models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[44] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[45] Richard E. Turner,et al. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control , 2016, ICML.

[46] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[47] Sergey Levine,et al. Uncertainty-Aware Reinforcement Learning for Collision Avoidance , 2017, ArXiv.

[48] Kamyar Azizzadenesheli,et al. Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[49] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[50] Gunhee Kim,et al. A Hierarchical Latent Structure for Variational Conversation Modeling , 2018, NAACL.

[51] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[52] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[53] Zhou Yu,et al. Sentiment Adaptive End-to-End Dialog Systems , 2018, ACL.

[54] Dilek Z. Hakkani-Tür,et al. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems , 2018, NAACL.

[55] Mehrdad Farajtabar,et al. More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[56] Bing Liu,et al. Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning , 2018, NAACL.

[57] M. de Rijke,et al. Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning , 2018, AAAI.

[58] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[59] Natasha Jaques,et al. Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems , 2019, NeurIPS.

[60] Emma Brunskill,et al. Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[61] Thomas Brox,et al. CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity , 2019, 1902.05605.

[62] Tom B. Brown,et al. Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[63] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[64] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[65] Pascale Fung,et al. HappyBot: Generating Empathetic Dialogue Responses by Improving User Experience Look-ahead , 2019, ArXiv.

[66] Jason Weston,et al. Learning from Dialogue after Deployment: Feed Yourself, Chatbot! , 2019, ACL.

[67] Dale Schuurmans,et al. Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[68] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[69] Harry Shum,et al. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[70] Generating Empathetic Dialogue Responses by Improving User Experience Lookahead , .