A Neural Turn-Taking Model without RNN

Sequential data such as speech and dialogs are usually modeled by Recurrent Neural Networks (RNN) and derivatives since the information can travel through time with such architecture. However, disadvantages exist with the use of RNNs, including the limited depth of neural networks and the GPU’s unfriendly training process. Estimating the timing of turn-taking is a critical feature of dialog systems. Such tasks require knowledge about past dialog contexts and have been modeled using RNNs in several studies. In this paper, we propose a non-RNN model for the timing estimation of turn-taking in dialogs. The proposed model takes lexical and acoustic features as its input to predict a turn’s end. We conducted experiments on four types of Japanese conversation datasets and show that with proper neural network designs, the long-term information in a dialog could propagate without a recurrent structure. The proposed model outperformed canonical RNN-based architectures on a turn-taking estimation task.

[1]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 2015 .

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Tatsuya Kawahara,et al.  Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios , 2018, ICMI.

[6]  Tong Zhang,et al.  Deep Pyramid Convolutional Neural Networks for Text Categorization , 2017, ACL.

[7]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[8]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[9]  Matthew W. Crocker,et al.  Enhancing Referential Success by Tracking Hearer Gaze , 2012, SIGDIAL Conference.

[10]  Juha Häkkinen,et al.  Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Ryo Ishii,et al.  Neural Dialogue Context Online End-of-Turn Detection , 2018, SIGDIAL Conference.

[12]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[13]  S. Duncan,et al.  On the structure of speaker–auditor interaction during speaking turns , 1974, Language in Society.

[14]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[15]  Maxine Eskénazi,et al.  A Finite-State Turn-Taking Model for Spoken Dialog Systems , 2009, NAACL.

[16]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[19]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[20]  Hiroshi Ishiguro,et al.  Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents , 2017, INTERSPEECH.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Björn Granström,et al.  Multimodality in Language and Speech Systems , 2002 .

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[28]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).