Exploiting loudness dynamics in stochastic models of turn-taking

Stochastic turn-taking models have traditionally been implemented as N-grams, which condition predictions on recent binary-valued speech/non-speech contours. The current work re-implements this function using feed-forward neural networks, capable of accepting binary- as well as continuous-valued features; performance is shown to asymptotically approach that of the N-gram baseline as model complexity increases. The conditioning context is then extended to leverage loudness contours. Experiments indicate that the additional sensitivity to loudness considerably decreases average cross entropy rates on unseen data, by 0.03 bits per framing interval of 100 ms. This reduction is shown to make loudness-sensitive conversants capable of better predictions, with attention memory requirements at least 5 times smaller and responsiveness latency at least 10 times shorter than the loudness-insensitive baseline.

[1]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[2]  Joseph Picone,et al.  Resegmentation of SWITCHBOARD , 1998, ICSLP.

[3]  Peter F. Craigmile All of Statistics: A Concise Course in Statistical Inference , 2005 .

[4]  Thomas P. Wilson,et al.  Models of Turn Taking in Conversational Interaction , 1984 .

[5]  Antoine Raux Flexible Turn-Taking for Spoken Dialogue Systems , 2006 .

[6]  Mattias Heldner,et al.  Preliminaries to an account of multi-party conversational turn-taking as an antiferromagnetic spin glass , 2010, NIPS 2010.

[7]  Mattias Heldner,et al.  An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Daben Liu,et al.  A cross-channel modeling approach for automatic segmentation of conversational telephone speech [automatic speech recognition applications] , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[9]  Kornel Laskowski,et al.  Corpus-independent history compression for stochastic turn-taking models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Paul T. Brandy Model for Generating On‐Off Speech Patterns in Two‐Way Conversation , 1969 .

[11]  Anna Hjalmarsson,et al.  The additive effect of turn-taking cues in human and synthetic voice , 2011, Speech Commun..

[12]  Mattias Heldner,et al.  Incremental Learning and Forgetting in Stochastic Turn-Taking Models , 2011, INTERSPEECH.

[13]  S. Feldstein,et al.  Markovian Model of Time Patterns of Speech , 1964, Science.

[14]  J. JAFFE,et al.  Markovian Models of Dialogic Time Patterns , 1967, Nature.

[15]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[16]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[17]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[18]  D. McFarland Respiratory markers of conversational interaction. , 2001, Journal of speech, language, and hearing research : JSLHR.

[19]  S. Duncan,et al.  On signalling that it's your turn to speak☆ , 1974 .