Measuring Final Lengthening for Speaker-Change Prediction

We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. This indicates that lengthening should be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.

[1]  D. Scott,et al.  Duration as a cue to the perception of a phrase boundary. , 1982, The Journal of the Acoustical Society of America.

[2]  François Pellegrino,et al.  Automatic language identification: an alternative approach to phonetic modelling , 2000, Signal Process..

[3]  Nivja H. Jong,et al.  Praat script to detect syllable nuclei and measure speech rate automatically , 2009, Behavior research methods.

[4]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Mark Liberman,et al.  Robust speaking rate estimation using broad phonetic class recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Thilo Pfau,et al.  Estimating the speaking rate by vowel detection , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Mattias Heldner,et al.  A general-purpose 32 ms prosodic vector for hidden Markov modeling , 2009, INTERSPEECH.

[10]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[11]  Nick Campbell,et al.  Estimating speaking rate in spontaneous speech from z-scores of pattern durations , 2004, INTERSPEECH.

[12]  M. Davies,et al.  A HYBRID APPROACH TO MUSICAL NOTE ONSET DETECTION , 2002 .

[13]  B.E.F. Lindblom,et al.  Some Temporal Regularities of Spoken Swedish , 1975 .

[14]  Andrew Hunt,et al.  Recurrent neural networks for syllabification , 1993, Speech Commun..

[15]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Hartmut R. Pfitzinger,et al.  LOCAL SPEECH RATE PERCEPTION IN GERMAN SPEECH , 1999 .

[17]  Thilo Pfau,et al.  On-line speaking rate estimation using Gaussian mixture models , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Mattias Heldner,et al.  An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Shrikanth S. Narayanan,et al.  Robust Speech Rate Estimation for Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Anna Hjalmarsson,et al.  Embodied conversational agents in computer assisted language learning , 2009, Speech Commun..

[21]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Julia Hirschberg,et al.  Turn-Yielding Cues in Task-Oriented Dialogue , 2009, SIGDIAL Conference.

[23]  Mattias Heldner,et al.  Pauses, gaps and overlaps in conversations , 2010, J. Phonetics.

[24]  Susanne Burger,et al.  Syllable detection in read and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[26]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[28]  Kåre Sjölander,et al.  An HMM-based system for automatic segmentation and alignment of speech , 2003 .

[29]  Olivier Pietquin,et al.  Single-speaker/multi-speaker co-channel speech classification , 2010, INTERSPEECH.