Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39.

[1]  Tatsuya Kawahara,et al.  Toward Adaptive Generation of Backchannels for Attentive Listening Agents , 2014 .

[2]  Matthias Sperber,et al.  Dynamic Transcription for Low-Latency Speech Translation , 2016, INTERSPEECH.

[3]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  I. A. de Kok,et al.  A Survey on Evaluation Metrics for Backchannel Prediction Models , 2012 .

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[8]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[9]  Louis-Philippe Morency,et al.  Learning Backchannel Prediction Model from Parasocial Consensus Sampling: A Subjective Evaluation , 2010, IVA.

[10]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[11]  Klaus Ries,et al.  HMM and neural network based speech act detection , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Mattias Heldner,et al.  The fundamental frequency variation spectrum , 2008 .

[13]  Dirk Heylen,et al.  A rule-based backchannel prediction model using pitch and pause information , 2010, INTERSPEECH.

[14]  Alon Lavie,et al.  The Janus-III Translation System: Speech-to-Speech Translation in Multiple Domains , 2004, Machine Translation.

[15]  Tatsuya Kawahara,et al.  Prediction and Generation of Backchannel Form for Attentive Listening Systems , 2016, INTERSPEECH.

[16]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[17]  Andreas Stolcke,et al.  Switchboard Discourse Language Modeling Project (Final Report) , 1997 .

[18]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[19]  Björn W. Schuller,et al.  Building Autonomous Sensitive Artificial Listeners , 2012, IEEE Transactions on Affective Computing.

[20]  A. Stolcke,et al.  Dialog act modelling for conversational speech , 1998 .

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[23]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[24]  Sebastian Stüker,et al.  Using Neural Networks for Data-Driven Backchannel Prediction: A Survey on Input Features and Training Techniques , 2015, HCI.

[25]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.