A Hierarchical Model for Dialog Act Recognition Considering Acoustic and Lexical Context Information

Dialog act recognition (DAR) is important to capture speakers’ intention in a dialog system. Traditional methods commonly use the lexical information from transcripts, acoustic information from speech, and dialog context information to do DAR. However, in these methods, textual context information may be considered, whereas acoustic context information is ignored, which leads to ambiguity in certain DAs especially in Mandarin. To solve the problem, we propose a hierarchical model for DAR considering context information of both lexical and acoustic prosody. The experimental results on a Mandarin dialog corpus demonstrate that the contextual-acoustic information is helpful for recognizing DAs. The contextually specific prosodies involved in the utterances such as the echo question and open-end question are beneficial to identify the users’ intention. We also investigate the effect of the context length on the DAR. The proper context length is approximately equal to the length of the entire subtopics.

[1]  Matthew Henderson,et al.  Discriminative spoken language understanding using word confusion networks , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2]  Ngoc Thang Vu,et al.  Lexico-Acoustic Neural-Based Models for Dialog Act Classification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yun Lei,et al.  Using Context Information for Dialog Act Classification in DNN Framework , 2017, EMNLP.

[4]  Franck Dernoncourt,et al.  Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks , 2016, NAACL.

[5]  Harish Arsikere,et al.  Novel acoustic features for automatic dialog-act tagging , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Csr Young,et al.  How to Do Things With Words , 2009 .

[8]  Qinghua Hu,et al.  Combining heterogeneous deep neural networks with conditional random fields for Chinese dialogue act recognition , 2015, Neurocomputing.

[9]  Hung-yi Lee,et al.  Neural Attention Models for Sequence Classification: Analysis and Application to Key Term Extraction and Dialogue Act Detection , 2016, INTERSPEECH.

[10]  Harshit Kumar,et al.  Dialogue Act Sequence Labeling using Hierarchical encoder with CRF , 2017, AAAI.

[11]  Matthias Zimmermann,et al.  Joint segmentation and classification of dialog acts using conditional random fields , 2009, INTERSPEECH.

[12]  Kallirroi Georgila,et al.  Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task , 2018, SIGDIAL Conference.

[13]  Chengqing Zong,et al.  CASIA-CASSIL: a Chinese Telephone Conversation Corpus in Real Scenarios with Multi-leveled Annotation , 2010, LREC.

[14]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? , 1998, Language and speech.

[15]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[16]  Jianwu Dang,et al.  CNN-BLSTM Based Question Detection from Dialogs Considering Phase and Context Information , 2019, INTERSPEECH.

[17]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[18]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[19]  Ngoc Thang Vu,et al.  Neural-based Context Representation Learning for Dialog Act Classification , 2017, SIGDIAL Conference.

[20]  Deng Cai,et al.  Dialogue Act Recognition via CRF-Attentive Structured Network , 2017, SIGIR.