Using Out-of-Domain Data for Lexical Addressee Detection in Human-Human-Computer Dialog

Addressee detection (AD) is an important problem for dialog systems in human-humancomputer scenarios (contexts involving multiple people and a system) because systemdirected speech must be distinguished from human-directed speech. Recent work on AD (Shriberg et al., 2012) showed good results using prosodic and lexical features trained on in-domain data. In-domain data, however, is expensive to collect for each new domain. In this study we focus on lexical models and investigate how well out-of-domain data (either outside the domain, or from single-user scenarios) can fill in for matched in-domain data. We find that human-addressed speech can be modeled using out-of-domain conversational speech transcripts, and that human-computer utterances can be modeled using single-user data: the resulting AD system outperforms a system trained only on matched in-domain data. Further gains (up to a 4% reduction in equal error rate) are obtained when in-domain and out-of-domain models are interpolated. Finally, we examine which parts of an utterance are most useful. We find that the first 1.5 seconds of an utterance contain most of the lexical information for AD, and analyze which lexical items convey this. Overall, we conclude that the H-H-C scenario can be approximated by combining data from H-C and H-H scenarios only.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[3]  Gökhan Tür,et al.  Bootstrapping Domain Detection Using Query Click Logs for New Domains , 2011, INTERSPEECH.

[4]  Gökhan Tür,et al.  Research Challenges and Opportunities in Mobile Applications [DSP Education] , 2011, IEEE Signal Processing Magazine.

[5]  Dilek Z. Hakkani-Tür,et al.  Research Challenges and Opportunities in Mobile Applications , 2011 .

[6]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[7]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[8]  Jianfeng Gao,et al.  Exploring web scale language models for search query processing , 2010, WWW '10.

[9]  Rieks op den Akker,et al.  A comparison of addressee detection methods for multiparty conversations , 2009 .

[10]  Maarten Sierhuis,et al.  Are You Talking to Me? Dialogue Systems Supporting Mixed Teams of Humans and Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[11]  Tanja Schultz,et al.  Identifying the addressee in human-human-robot interactions based on head pose and speech , 2004, ICMI '04.

[12]  Eric Horvitz,et al.  Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions , 2011, SIGDIAL Conference.

[13]  Tanja Schultz,et al.  Tue-SeA Real-Time Speech Command Detector for a Smart Control Room , 2011, INTERSPEECH.

[14]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Eric Horvitz,et al.  Continuous listening for unconstrained spoken dialog , 2000, INTERSPEECH.

[16]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[17]  Dilek Z. Hakkani-Tür,et al.  Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog , 2012, INTERSPEECH.