论文信息 - Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog

Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog

New challenges arise for addressee detection when multiple people interact jointly with a spoken dialog system using unconstrained natural language. We study the problem of discriminating computer-directed from human-directed speech in a new corpus of human-human-computer (H-H-C) dialog, using lexical and prosodic features. The prosodic features use no word, context, or speaker information. Results with 19% WER speech recognition show improvements from lexical features (EER=23.1%) to prosodic features (EER=12.6%) to a combined model (EER=11.1%). Prosodic features also provide a 35% error reduction over a lexical model using true words (EER from 10.2% to 6.7%). Modeling energy contours with GMMs provides a particularly good prosodic model. While lexical models perform well for commands, they confuse free-form system-directed speech with human-human speech. Prosodic models dramatically reduce these confusions, implying that users change speaking style as they shift addressees (computer versus human) within a session. Overall results provide strong support for combining simple acoustic-prosodic models with lexical models to detect speaking style differences for this task.

[1] Tanja Schultz,et al. Tue-SeA Real-Time Speech Command Detector for a Smart Control Room , 2011, INTERSPEECH.

[2] Tanja Schultz,et al. Identifying the addressee in human-human-robot interactions based on head pose and speech , 2004, ICMI '04.

[3] Yoram Singer,et al. BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[4] Rieks op den Akker,et al. A comparison of addressee detection methods for multiparty conversations , 2009 .

[5] Eric Horvitz,et al. Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions , 2011, SIGDIAL Conference.

[6] Patrick Kenny,et al. Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Dilek Z. Hakkani-Tür,et al. Research Challenges and Opportunities in Mobile Applications , 2011 .

[8] Eric Horvitz,et al. Continuous listening for unconstrained spoken dialog , 2000, INTERSPEECH.

[9] Paul Boersma,et al. Praat, a system for doing phonetics by computer , 2002 .

[10] Maarten Sierhuis,et al. Are You Talking to Me? Dialogue Systems Supporting Mixed Teams of Humans and Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[11] Gökhan Tür,et al. Bootstrapping Domain Detection Using Query Click Logs for New Domains , 2011, INTERSPEECH.

[12] P. Boersma. Praat : doing phonetics by computer (version 5.1.05) , 2009 .

[13] Hsiao-Chuan Wang,et al. Language identification using pitch contour information , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14] Tetsuya Takiguchi,et al. System request detection in human conversation based on multi-resolution Gabor wavelet features , 2009, INTERSPEECH.

[15] Lukás Burget,et al. Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.