Hierarchical Reinforcement Learning With Guidance for Multi-Domain Dialogue Policy

Achieving high performance in a multi-domain dialogue system with low computation is undoubtedly challenging. Previous works applying an end-to-end approach have been very successful. However, the computational cost remains a major issue since the large-sized language model using GPT-2 is required. Meanwhile, the optimization for individual components in the dialogue system has not shown promising result, especially for the component of dialogue management due to the complexity of multi-domain state and action representation. To cope with these issues, this article presents an efficient guidance learning where the imitation learning and the hierarchical reinforcement learning (HRL) with human-in-the-loop are performed to achieve high performance via an inexpensive dialogue agent. The behavior cloning with auxiliary tasks is exploited to identify the important features in latent representation. In particular, the proposed HRL is designed to treat each goal of a dialogue with the corresponding sub-policy so as to provide efficient dialogue policy learning by utilizing the guidance from human through action pruning and action evaluation, as well as the reward obtained from the interaction with the simulated user in the environment. Experimental results on ConvLab-2 framework show that the proposed method achieves state-of-the-art performance in dialogue policy optimization and outperforms the GPT-2 based solutions in end-to-end system evaluation.

[1]  Jen-Tzung Chien,et al.  Augmentation Strategy Optimization for Language Understanding , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jen-Tzung Chien,et al.  Model-Based Soft Actor-Critic , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[3]  Jen-Tzung Chien,et al.  Multitask Generative Adversarial Imitation Learning for Multi-Domain Dialogue System , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Mahdin Rohmatillah,et al.  Corrective Guidance and Learning for Dialogue Management , 2021, CIKM.

[5]  Mahdin Rohmatillah,et al.  Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy , 2021, Interspeech.

[6]  Jen-Tzung Chien,et al.  Variational Dialogue Generation with Normalizing Flows , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[8]  Kun Han,et al.  A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining , 2021, ArXiv.

[9]  Issam El-Naqa,et al.  Exploring State Transition Uncertainty in Variational Reinforcement Learning , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[10]  Xiaojun Quan,et al.  UBAR: Towards Fully End-to-End Task-Oriented Dialog Systems with GPT-2 , 2020, AAAI.

[11]  Carel van Niekerk,et al.  LAVA: Latent Action Spaces via Variational Auto-encoding for Dialogue Policy Optimization , 2020, COLING.

[12]  Trevor Darrell,et al.  Fighting Copycat Agents in Behavioral Cloning from Observation Histories , 2020, NeurIPS.

[13]  Po-Chien Hsu,et al.  Stochastic Curiosity Exploration for Dialogue Systems , 2020, INTERSPEECH.

[14]  Pascale Fung,et al.  MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems , 2020, EMNLP.

[15]  Kai Yu,et al.  Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[17]  Jianfeng Gao,et al.  Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation , 2020, SIGDIAL.

[18]  Jianfeng Gao,et al.  Guided Dialog Policy Learning without Adversarial Learning in the Loop , 2020, ArXiv.

[19]  Minlie Huang,et al.  Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition , 2020, ACL.

[20]  Jianfeng Gao,et al.  ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems , 2020, ACL.

[21]  Zhijian Ou,et al.  Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context , 2019, AAAI.

[22]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[23]  Raghav Gupta,et al.  Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2019, AAAI.

[24]  Bill Byrne,et al.  Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[25]  Minlie Huang,et al.  Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog , 2019, EMNLP.

[26]  Ivan Vulić,et al.  Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems , 2019, EMNLP.

[27]  Wenhu Chen,et al.  Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention , 2019, ACL.

[28]  Eder Santana,et al.  Exploring the Limitations of Behavior Cloning for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Maxine Eskénazi,et al.  Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models , 2019, NAACL.

[30]  Yiming Yang,et al.  Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning , 2018, AAAI.

[31]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[32]  Zhaochun Ren,et al.  Explicit State Tracking with Semi-Supervisionfor Neural Dialogue Generation , 2018, CIKM.

[33]  Min-Yen Kan,et al.  Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures , 2018, ACL.

[34]  Bing Liu,et al.  Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning , 2018, NAACL.

[35]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[36]  Pei-Hao Su,et al.  Sample Efficient Deep Reinforcement Learning for Dialogue Systems With Large Action Spaces , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[38]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[39]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[40]  David Vandyke,et al.  PyDial: A Multi-domain Statistical Dialogue System Toolkit , 2017, ACL.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  David Vandyke,et al.  Conditional Generation and Snapshot Learning in Neural Dialogue Systems , 2016, EMNLP.

[44]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[45]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[46]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[47]  Aaron C. Courville,et al.  Generative Adversarial Networks , 2014, 1406.2661.

[48]  Kevin Crowston,et al.  Amazon Mechanical Turk: A Research Tool for Organizations and Information Systems Scholars , 2012, Shaping the Future of ICT Research.

[49]  Stephanie Seneff,et al.  Dialogue Management in the Mercury Flight Reservation System , 2000 .

[50]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[51]  Jen-Tzung Chien,et al.  AVAST: Attentive Variational State Tracker in a Reinforced Navigator , 2022, AACL.

[52]  Jen-Tzung Chien,et al.  Bayesian Multi-Temporal-Difference Learning , 2022, APSIPA Transactions on Signal and Information Processing.

[53]  Ondrej Dusek,et al.  AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation , 2021, ArXiv.

[54]  Jianfeng Gao,et al.  Multi-Domain Task Completion Dialog Challenge II at DSTC9 , 2020 .

[55]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[56]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[57]  Dilek Z. Hakkani-Tür,et al.  MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines , 2019, ArXiv.

[58]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .