Using Neural Networks for Data-Driven Backchannel Prediction: A Survey on Input Features and Training Techniques

In order to make human computer interaction more social, the use of supporting backchannel cues can be beneficial. Such cues can be delivered in different channels like vision, speech or gestures. In this work, we focus on the prediction of acoustic backchannels in terms of speech. Previously, this prediction has been accomplished by using rule-based approaches. But like every rule-based implementation, it is dependent on a fixed set of handwritten rules which have to be changed every time the mechanism is adjusted or different data is used. In this paper we want to overcome these limitations by making use of recent advancements in the field of machine learning. We show that backchannel predictions can be generated by means of a neural network based approach. Such a method has the advantage of depending only on the training data, without the need of handwritten rules.

[1]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[2]  Louis-Philippe Morency,et al.  Predicting Listener Backchannels: A Probabilistic Multimodal Approach , 2008, IVA.

[3]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Dirk Heylen,et al.  A rule-based backchannel prediction model using pitch and pause information , 2010, INTERSPEECH.

[5]  Dirk Heylen,et al.  Iterative perceptual learning for social behavior synthesis , 2013, Journal on Multimodal User Interfaces.

[6]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[7]  Klaus Ries,et al.  HMM and neural network based speech act detection , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[9]  I. A. de Kok,et al.  A Survey on Evaluation Metrics for Backchannel Prediction Models , 2012 .

[10]  Tatsuya Kawahara,et al.  Toward Adaptive Generation of Backchannels for Attentive Listening Agents , 2014 .

[11]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[12]  Louis-Philippe Morency,et al.  Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior , 2010, AAMAS.

[13]  A. Stolcke,et al.  Dialog act modelling for conversational speech , 1998 .