Towards Incremental End-of-Utterance Detection in Dialogue Systems

We define the task of incremental or 0lag utterance segmentation, that is, the task of segmenting an ongoing speech recognition stream into utterance units, and present first results. We use a combination of hidden event language model, features from an incremental parser, and acoustic / prosodic features to train classifiers on real-world conversational data (from the Switchboard corpus). The best classifiers reach an F-score of around 56%, improving over baseline and related work.

[1]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[2]  F. Grosjean How long is the sentence? Prediction and prosody in the on-line processing of language , 1983 .

[3]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Olac Fuentes,et al.  A Filter-Based Approach to Detect End-of-Utterances from Prosody in Dialog Systems , 2007, HLT-NAACL.

[5]  Dilek Z. Hakkani-Tür,et al.  Cross-linguistic analysis of prosodic features for sentence segmentation , 2007, INTERSPEECH.

[6]  Elizabeth Shriberg,et al.  Comparing Evaluation Metrics for Sentence Boundary Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  David G. Novick,et al.  Root causes of lost time and user stress in a simple dialog system , 2005, INTERSPEECH.

[8]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[9]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[10]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[11]  Ellen Campana,et al.  Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods , 2007 .

[12]  David Schlangen,et al.  From reaction to prediction: experiments with computational models of turn-taking , 2006, INTERSPEECH.