论文信息 - Hierarchical Reinforcement Learning With Guidance for Multi-Domain Dialogue Policy

Hierarchical Reinforcement Learning With Guidance for Multi-Domain Dialogue Policy

Achieving high performance in a multi-domain dialogue system with low computation is undoubtedly challenging. Previous works applying an end-to-end approach have been very successful. However, the computational cost remains a major issue since the large-sized language model using GPT-2 is required. Meanwhile, the optimization for individual components in the dialogue system has not shown promising result, especially for the component of dialogue management due to the complexity of multi-domain state and action representation. To cope with these issues, this article presents an efficient guidance learning where the imitation learning and the hierarchical reinforcement learning (HRL) with human-in-the-loop are performed to achieve high performance via an inexpensive dialogue agent. The behavior cloning with auxiliary tasks is exploited to identify the important features in latent representation. In particular, the proposed HRL is designed to treat each goal of a dialogue with the corresponding sub-policy so as to provide efficient dialogue policy learning by utilizing the guidance from human through action pruning and action evaluation, as well as the reward obtained from the interaction with the simulated user in the environment. Experimental results on ConvLab-2 framework show that the proposed method achieves state-of-the-art performance in dialogue policy optimization and outperforms the GPT-2 based solutions in end-to-end system evaluation.

Jen-Tzung Chien | Mahdin Rohmatillah

[1] Jen-Tzung Chien,et al. Augmentation Strategy Optimization for Language Understanding , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Jen-Tzung Chien,et al. Model-Based Soft Actor-Critic , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[3] Jen-Tzung Chien,et al. Multitask Generative Adversarial Imitation Learning for Multi-Domain Dialogue System , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4] Mahdin Rohmatillah,et al. Corrective Guidance and Learning for Dialogue Management , 2021, CIKM.

[5] Mahdin Rohmatillah,et al. Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy , 2021, Interspeech.

[6] Jen-Tzung Chien,et al. Variational Dialogue Generation with Normalizing Flows , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Pieter Abbeel,et al. Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[8] Kun Han,et al. A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining , 2021, ArXiv.

[9] Issam El-Naqa,et al. Exploring State Transition Uncertainty in Variational Reinforcement Learning , 2021, 2020 28th European Signal Processing Conference (EUSIPCO).

[10] Xiaojun Quan,et al. UBAR: Towards Fully End-to-End Task-Oriented Dialog Systems with GPT-2 , 2020, AAAI.

[11] Carel van Niekerk,et al. LAVA: Latent Action Spaces via Variational Auto-encoding for Dialogue Policy Optimization , 2020, COLING.

[12] Trevor Darrell,et al. Fighting Copycat Agents in Behavioral Cloning from Observation Histories , 2020, NeurIPS.

[13] Po-Chien Hsu,et al. Stochastic Curiosity Exploration for Dialogue Systems , 2020, INTERSPEECH.

[14] Pascale Fung,et al. MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems , 2020, EMNLP.

[15] Kai Yu,et al. Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Nando de Freitas,et al. Critic Regularized Regression , 2020, NeurIPS.

[17] Jianfeng Gao,et al. Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation , 2020, SIGDIAL.

[18] Jianfeng Gao,et al. Guided Dialog Policy Learning without Adversarial Learning in the Loop , 2020, ArXiv.