Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog

New challenges arise for addressee detection when multiple people interact jointly with a spoken dialog system using unconstrained natural language. We study the problem of discriminating computer-directed from human-directed speech in a new corpus of human-human-computer (H-H-C) dialog, using lexical and prosodic features. The prosodic features use no word, context, or speaker information. Results with 19% WER speech recognition show improvements from lexical features (EER=23.1%) to prosodic features (EER=12.6%) to a combined model (EER=11.1%). Prosodic features also provide a 35% error reduction over a lexical model using true words (EER from 10.2% to 6.7%). Modeling energy contours with GMMs provides a particularly good prosodic model. While lexical models perform well for commands, they confuse free-form system-directed speech with human-human speech. Prosodic models dramatically reduce these confusions, implying that users change speaking style as they shift addressees (computer versus human) within a session. Overall results provide strong support for combining simple acoustic-prosodic models with lexical models to detect speaking style differences for this task.

[1]  Tanja Schultz,et al.  Tue-SeA Real-Time Speech Command Detector for a Smart Control Room , 2011, INTERSPEECH.

[2]  Tanja Schultz,et al.  Identifying the addressee in human-human-robot interactions based on head pose and speech , 2004, ICMI '04.

[3]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[4]  Rieks op den Akker,et al.  A comparison of addressee detection methods for multiparty conversations , 2009 .

[5]  Eric Horvitz,et al.  Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions , 2011, SIGDIAL Conference.

[6]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Dilek Z. Hakkani-Tür,et al.  Research Challenges and Opportunities in Mobile Applications , 2011 .

[8]  Eric Horvitz,et al.  Continuous listening for unconstrained spoken dialog , 2000, INTERSPEECH.

[9]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[10]  Maarten Sierhuis,et al.  Are You Talking to Me? Dialogue Systems Supporting Mixed Teams of Humans and Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[11]  Gökhan Tür,et al.  Bootstrapping Domain Detection Using Query Click Logs for New Domains , 2011, INTERSPEECH.

[12]  P. Boersma Praat : doing phonetics by computer (version 5.1.05) , 2009 .

[13]  Hsiao-Chuan Wang,et al.  Language identification using pitch contour information , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  Tetsuya Takiguchi,et al.  System request detection in human conversation based on multi-resolution Gabor wavelet features , 2009, INTERSPEECH.

[15]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.