Automatic detection of discourse structure for speech recognition and understanding

We describe a new approach for statistical modeling and detection of discourse structure for natural conversational speech. Our model is based on 42 dialog acts (DAs), (question, answer, backchannel, agreement, disagreement, apology, etc.). We labeled 1155 conversations from the Switchboard (SWBD) database (Godfrey et al., 1992) of human-to-human telephone conversations with these 42 types and trained a dialog act detector based on three distinct knowledge sources: sequences of words which characterize a dialog act; prosodic features which characterize a dialog act; and a statistical discourse grammar. Our combined detector, although still in preliminary stages, already achieves a 65% dialog act detection rate based on acoustic waveforms, and 72% accuracy based on word transcripts. Using this detector to switch among the 42 dialog-act-specific trigram LMs also gave us an encouraging but not statistically significant reduction in SWBD word error.

[1]  Mark G. Core,et al.  Coding Dialogs with the DAMSL Annotation Scheme , 1997 .

[2]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? , 1998, Language and speech.

[3]  Alex Waibel,et al.  Prosody and speech recognition , 1988 .

[4]  Norbert Reithinger,et al.  Predicting dialogue acts for a speech-to-speech translation system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Elmar Nöth,et al.  Dialog act classification with the help of prosody , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[7]  Monika Woszczyna,et al.  Inferring linguistic structure in spoken language , 1994, ICSLP.

[8]  Norbert Reithinger,et al.  Predicting dialogue acts for a speech-to-speech translation system , 1996 .

[9]  Masaaki Nagata,et al.  First steps towards statistical modeling of dialogue to predict the speech act type of the next utterance , 1994, Speech Communication.

[10]  Simon King,et al.  Using prosodic information to constrain language models for spoken dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Kenji Kita,et al.  Automatic acquisition of probabilistic dialogue models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  A. Stolcke,et al.  Dialog act modelling for conversational speech , 1998 .

[13]  Simon King,et al.  Using intonation to constrain language models in speech recognition , 1997, EUROSPEECH.

[14]  Roger K. Moore,et al.  A theory of word frequencies and its application to dialogue move recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.

[16]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[19]  P Taylor,et al.  Intonation and dialogue context as constraints for speech recognition , 1998 .

[20]  Victor Zue,et al.  Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue , 1995, EUROSPEECH.

[21]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[22]  Hitoshi Iida,et al.  Dialogue interpretation model and its application to next utterance prediction for spoken language processing , 1991, EUROSPEECH.

[23]  Sean Connolly,et al.  Improvements in switchboard recognition and topic identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  Mark Terry,et al.  Automated query identification in English dialogue , 1994, ICSLP.

[25]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..