Learning Goal-oriented Dialogue Policy with opposite Agent Awareness

Most existing approaches for goal-oriented dialogue policy learning used reinforcement learning, which focuses on the target agent policy and simply treat the opposite agent policy as part of the environment. While in real-world scenarios, the behavior of an opposite agent often exhibits certain patterns or underlies hidden policies, which can be inferred and utilized by the target agent to facilitate its own decision making. This strategy is common in human mental simulation by first imaging a specific action and the probable results before really acting it. We therefore propose an opposite behavior aware framework for policy learning in goal-oriented dialogues. We estimate the opposite agent's policy from its behavior and use this estimation to improve the target agent by regarding it as part of the target policy. We evaluate our model on both cooperative and competitive dialogue tasks, showing superior performance over state-of-the-art baselines.

[1]  David Vandyke,et al.  Continuously Learning Neural Dialogue Management , 2016, ArXiv.

[2]  Mary P. Harper,et al.  Learning from 26 Languages: Program Management and Science in the Babel Program , 2014, COLING.

[3]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[4]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[5]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[6]  Yiming Yang,et al.  Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning , 2018, AAAI.

[7]  Gökhan Tür,et al.  User Modeling for Task Oriented Dialogues , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Enhong Chen,et al.  Budgeted Policy Learning for Task-Oriented Dialogue Systems , 2019, ACL.

[9]  Mike Lewis,et al.  Hierarchical Text Generation and Planning for Strategic Dialogue , 2017, ICML.

[10]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[11]  Seunghak Yu,et al.  Scaling up deep reinforcement learning for multi-domain dialogue systems , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[12]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[13]  R. Gordon Folk Psychology as Simulation , 1986 .

[14]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[15]  M. Tomasello,et al.  Does the chimpanzee have a theory of mind? 30 years later , 2008, Trends in Cognitive Sciences.

[16]  Maxine Eskénazi,et al.  Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning , 2016, SIGDIAL Conference.

[17]  Hao Tian,et al.  Policy Learning for Domain Selection in an Extensible Multi-domain Spoken Dialogue System , 2014, EMNLP.

[18]  Dilek Z. Hakkani-Tür,et al.  Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems , 2018, NAACL.

[19]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[21]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22]  Jianfeng Gao,et al.  A User Simulator for Task-Completion Dialogues , 2016, ArXiv.

[23]  Steve J. Young,et al.  A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies , 2006, The Knowledge Engineering Review.

[24]  Yang Feng,et al.  Bridging the Gap between Training and Inference for Neural Machine Translation , 2019, ACL.

[25]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[26]  Oliver Lemon,et al.  Reinforcement Learning for Adaptive Dialogue Systems - A Data-driven Methodology for Dialogue Management and Natural Language Generation , 2011, Theory and Applications of Natural Language Processing.

[27]  Roberto Pieraccini,et al.  Learning dialogue strategies within the Markov decision process framework , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[28]  YoungSteve,et al.  A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies , 2006 .

[29]  Yann Dauphin,et al.  Deal or No Deal? End-to-End Learning of Negotiation Dialogues , 2017, EMNLP.

[30]  Iñigo Casanueva,et al.  Neural User Simulation for Corpus-based Policy Optimisation of Spoken Dialogue Systems , 2018, SIGDIAL Conference.

[31]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[32]  Minlie Huang,et al.  Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog , 2019, EMNLP.

[33]  Jing He,et al.  A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems , 2016, INTERSPEECH.

[34]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.

[35]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[36]  Pararth Shah,et al.  Multi-Action Dialog Policy Learning with Interactive Human Teaching , 2020, SIGDIAL.

[37]  Lihong Li,et al.  Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..

[38]  Jing He,et al.  Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[39]  A. Goldman,et al.  Mirror neurons and the simulation theory of mind-reading , 1998, Trends in Cognitive Sciences.

[40]  Stefan Ultes,et al.  Feudal Reinforcement Learning for Dialogue Management in Large Domains , 2018, NAACL.

[41]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[42]  David Vandyke,et al.  Policy committee for adaptation in multi-domain spoken dialogue systems , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[43]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[44]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[45]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[46]  Minlie Huang,et al.  Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition , 2020, ACL.

[47]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[48]  Bing Liu,et al.  Iterative policy learning in end-to-end trainable task-oriented neural dialog models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[49]  Jianfeng Gao,et al.  BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.