论文信息 - Measuring Final Lengthening for Speaker-Change Prediction

Measuring Final Lengthening for Speaker-Change Prediction

We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. This indicates that lengthening should be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.

Kornel Laskowski | Anna Hjalmarsson

[1] D. Scott,et al. Duration as a cue to the perception of a phrase boundary. , 1982, The Journal of the Acoustical Society of America.

[2] François Pellegrino,et al. Automatic language identification: an alternative approach to phonetic modelling , 2000, Signal Process..

[3] Nivja H. Jong,et al. Praat script to detect syllable nuclei and measure speech rate automatically , 2009, Behavior research methods.

[4] Vesa T. Peltonen,et al. Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[6] Jean-Pierre Martens,et al. A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7] Mark Liberman,et al. Robust speaking rate estimation using broad phonetic class recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Thilo Pfau,et al. Estimating the speaking rate by vowel detection , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9] Mattias Heldner,et al. A general-purpose 32 ms prosodic vector for hidden Markov modeling , 2009, INTERSPEECH.

[10] Eric Fosler-Lussier,et al. Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[11] Nick Campbell,et al. Estimating speaking rate in spontaneous speech from z-scores of pattern durations , 2004, INTERSPEECH.

[12] M. Davies,et al. A HYBRID APPROACH TO MUSICAL NOTE ONSET DETECTION , 2002 .

[13] B.E.F. Lindblom,et al. Some Temporal Regularities of Spoken Swedish , 1975 .

[14] Andrew Hunt,et al. Recurrent neural networks for syllabification , 1993, Speech Commun..

[15] Eric Fosler-Lussier,et al. Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16] Hartmut R. Pfitzinger,et al. LOCAL SPEECH RATE PERCEPTION IN GERMAN SPEECH , 1999 .

[17] Thilo Pfau,et al. On-line speaking rate estimation using Gaussian mixture models , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18] Mattias Heldner,et al. An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Shrikanth S. Narayanan,et al. Robust Speech Rate Estimation for Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20] Anna Hjalmarsson,et al. Embodied conversational agents in computer assisted language learning , 2009, Speech Commun..

[21] Andreas Stolcke,et al. A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22] Julia Hirschberg,et al. Turn-Yielding Cues in Task-Oriented Dialogue , 2009, SIGDIAL Conference.

[23] Mattias Heldner,et al. Pauses, gaps and overlaps in conversations , 2010, J. Phonetics.

[24] Susanne Burger,et al. Syllable detection in read and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25] George Tzanetakis,et al. Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[26] Malcolm Slaney,et al. Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27] P. Mermelstein. Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[28] Kåre Sjölander,et al. An HMM-based system for automatic segmentation and alignment of speech , 2003 .

[29] Olivier Pietquin,et al. Single-speaker/multi-speaker co-channel speech classification , 2010, INTERSPEECH.