Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation

In spite of the recent success of Dialogue Act (DA) classification, the majority of prior works focus on text-based classification with oracle transcriptions, i.e. human transcriptions, instead of Automatic Speech Recognition (ASR)'s transcriptions. In spoken dialog systems, however, the agent would only have access to noisy ASR transcriptions, which may further suffer performance degradation due to domain shift. In this paper, we explore the effectiveness of using both acoustic and textual signals, either oracle or ASR transcriptions, and investigate speaker domain adaptation for DA classification. Our multimodal model proves to be superior to the unimodal models, particularly when the oracle transcriptions are not available. We also propose an effective method for speaker domain adaptation , which achieves competitive results.

[1]  Quan Hung Tran,et al.  A Hierarchical Neural Model for Learning Sequences of Dialogue Acts , 2017, EACL.

[2]  Atiq Islam,et al.  Dialog Act Classification Using Acoustic and Discourse Information of Maptask Data , 2010, Int. J. Comput. Intell. Appl..

[3]  Ingrid Zukerman,et al.  Preserving Distributional Information in Dialogue Act Classification , 2017, EMNLP.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Gholamreza Haffari,et al.  A Latent Variable Recurrent Neural Network for Discourse Relation Language Models , 2016, ArXiv.

[6]  Hung-yi Lee,et al.  Neural Attention Models for Sequence Classification: Analysis and Application to Key Term Extraction and Dialogue Act Detection , 2016, INTERSPEECH.

[7]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[8]  Alexander Popov Deep Learning Architecture for Part-of-Speech Tagging with Word and Suffix Embeddings , 2016, AIMSA.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[14]  Elizabeth Shriberg,et al.  Switchboard SWBD-DAMSL shallow-discourse-function annotation coders manual , 1997 .

[15]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Bowen Zhou,et al.  Sequence-to-Sequence RNNs for Text Summarization , 2016, ArXiv.

[17]  Gholamreza Haffari,et al.  A Latent Variable Recurrent Neural Network for Discourse Relation Language Models , 2016 .