Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning

Abstract We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the user’s input utterance with many other utterances. Response utterances are then generated based on this response type designation (back channel, changing the topic, expanding the topic, etc.). This allows the generation of more appropriate responses than conventional end-to-end approaches, which only use the user’s input to directly generate response utterances. As a response type selector, we propose an LSTM-based encoder–decoder framework utilizing acoustic and linguistic features extracted from input utterances. In order to extract these features more accurately, we utilize not only input utterances but also response utterances in the training corpus. To do so, multi-task learning using multiple decoders is also investigated. To evaluate our proposed method, we conducted experiments using a corpus of dialogs between elderly people and an interviewer. Our proposed method outperformed conventional methods using either a point-wise classifier based on Support Vector Machines, or a single-task learning LSTM. The best performance was achieved when our two response type selectors (one trained using acoustic features, and the other trained using linguistic features) were combined, and multi-task learning was also performed.

[1]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[2]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[5]  Kazuya Shitaoka,et al.  Active Listening System for a Conversation Robot , 2017 .

[6]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Ryota Nishimura,et al.  Type of Response Selection utilizing User Utterance Word Sequence, LSTM and Multi-task Learning for Chat-like Spoken Dialog Systems , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[9]  Cengiz Acartürk,et al.  Elderly Speech-Gaze Interaction - State of the Art and Challenges for Interaction Design , 2015, HCI.

[10]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[11]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[12]  Kiyohiro Shikano,et al.  Operating a public spoken guidance system in real environment , 2005, INTERSPEECH.

[13]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[14]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[15]  Kenichi Takahashi,et al.  Neural Utterance Ranking Model for Conversational Dialogue Systems , 2016, SIGDIAL Conference.

[16]  B. Winblad,et al.  Influence of social network on occurrence of dementia: a community-based longitudinal study , 2000, The Lancet.

[17]  Yasuo Horiuchi,et al.  Investigation of the relationship between turn-taking and prosodic features in spontaneous dialogue , 2005, INTERSPEECH.

[18]  Kengo Ohta,et al.  Selecting type of response for chat-like spoken dialogue systems based on acoustic features of user utterances , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  Hanae Koiso,et al.  Survey of Conversational Behavior: Towards the Design of a Balanced Corpus of Everyday Japanese Conversation , 2016, LREC.

[21]  D. Fox,et al.  Towards Personal Service Robots for the Elderly , 1999 .

[22]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[24]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[25]  Ryuichiro Higashinaka,et al.  Towards an open-domain conversational system fully based on natural language processing , 2014, COLING.

[26]  R. Butler,et al.  The life review: An interpretation of reminiscence in the aged , 1963 .

[27]  Anthony J. Robinson,et al.  Static and Dynamic Error Propagation Networks with Application to Speech Coding , 1987, NIPS.

[28]  Ryota Nishimura,et al.  Customization of an example-based dialog system with user data and distributed word representations , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[29]  K. Takeda,et al.  Recognizing emotions from speech using a physical model , 2018 .

[30]  Chung-Hsien Wu,et al.  A chatbot using LSTM-based multi-layer embedding for elderly care , 2017, 2017 International Conference on Orange Technologies (ICOT).

[31]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[32]  Alan Ritter,et al.  Data-Driven Response Generation in Social Media , 2011, EMNLP.

[33]  Tatsuya Kawahara,et al.  Talking with ERICA, an autonomous android , 2016, SIGDIAL Conference.

[34]  Lenore J Launer,et al.  The effect of social engagement on incident dementia: the Honolulu-Asia Aging Study. , 2006, American journal of epidemiology.

[35]  Seiichi Nakagawa,et al.  Response Timing Detection Using Prosodic and Linguistic Information for Human-friendly Spoken Dialog Systems (論文特集:人間と共生する情報システム) , 2005 .

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.