A Filter-Based Approach to Detect End-of-Utterances from Prosody in Dialog Systems

We propose an efficient method to detect end-of-utterances from prosodic information in conversational speech. Our method is based on the application of a large set of binary and ramp filters to the energy and fundamental frequency signals obtained from the speech signal. These filter responses, which can be computed very efficiently, are used as input to a learning algorithm that generates the final detector. Preliminary experiments using data obtained from conversations show that an accurate classifier can be trained efficiently and that good results can be obtained without requiring a speech recognition system.

[1]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[4]  Tasha Kaye Hollingsed Responsive behavior in tutorial spoken dialogues , 2006 .

[5]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  Nigel G. Ward,et al.  A case study in the identification of prosodic cues to turn-taking: back-channeling in Arabic , 2006, INTERSPEECH.

[8]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[9]  Andreas Stolcke,et al.  Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing , 2004 .

[10]  Juha Häkkinen,et al.  Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Franklin C. Crow,et al.  Summed-area tables for texture mapping , 1984, SIGGRAPH.

[12]  Xu Bo,et al.  AN IMPROVED ENTROPY-BASED ENDPOINT DETECTION ALGORITHM , 2002 .

[13]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Olac Fuentes,et al.  Prosodic feature generation for back-channel prediction , 2006, INTERSPEECH.