Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?

Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study is based on more than 1000 conversations from the Switchboard corpus. DAs were hand-annotated, and prosodic features (duration, pause, F0, energy, and speaking rate) were automatically extracted for each DA. In training, decision trees based on these features were inferred; trees were then applied to unseen test data to evaluate performance. Performance was evaluated for prosody models alone, and after combining the prosody models with word information—either from true words or from the output of an automatic speech recognizer. For an overall classification task, as well as three subtasks, prosody made significant contributions to classification. Feature-specific analyses further revealed that although canonical features (such as F0 for questions) were important, less obvious features could compensate if canonical features were removed. Finally, in each task, integrating the prosodic model with a DA-specific statistical language model improved performance over that of the language model alone, especially for the case of recognized words. Results suggest that DAs are redundantly marked in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications.

[1]  James F. Allen,et al.  A Study on Prosody and Discourse Structure in Cooperative Dialogues , 1993 .

[2]  Kenji Kita,et al.  Automatic acquisition of probabilistic dialogue models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Alan W. Black,et al.  Synthesizing conversational intonation from a linguistically rich input , 1994, SSW.

[4]  Mari Ostendorf,et al.  A Multi-level Model for Recognition of Intonation Labels , 1997, Computing Prosody.

[5]  Elmar Nöth,et al.  Integrated dialog act segmentation and classification using prosodic features and language models , 1997, EUROSPEECH.

[6]  A. Stolcke,et al.  Dialog act modelling for conversational speech , 1998 .

[7]  Mark Terry,et al.  Automated query identification in English dialogue , 1994, ICSLP.

[8]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[9]  Florien J. van Beinum,et al.  Relationship between discourse structure and dynamic speech rate , 1996, ICSLP.

[10]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[11]  Alex Waibel,et al.  Prosody and speech recognition , 1988 .

[12]  Björn Granström,et al.  On the Analysis of Prosody in Interaction , 1997, Computing Prosody.

[13]  Ken Samuel,et al.  Computing Dialogue Acts from Features with Transformation-Based Learning , 1998, ArXiv.

[14]  Victor Zue,et al.  Statistical and linguistic analyses of F0 in read and spontaneous speech , 1992, ICSLP.

[15]  N. M. Veilleuz,et al.  Prosody/Parse Scoring and Its Application in ATIS , 1993, HLT.

[16]  Norbert Reithinger,et al.  Predicting dialogue acts for a speech-to-speech translation system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[18]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  Hitoshi Iida,et al.  Dialogue interpretation model and its application to next utterance prediction for spoken language processing , 1991, EUROSPEECH.

[20]  Elmar Nöth Prosodische Information in der automatischen Spracherkennung: Berechnung und Anwendung , 1991 .

[21]  Daniel Jurafsky,et al.  Lexical, Prosodic, and Syntactic Cues for Dialog Acts , 1998 .

[22]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[23]  R. Geluykens,et al.  Local and global prosodic cues to discourse organization in dialogues , 1993 .

[24]  Mitch Weintraub,et al.  Robust speech recognition in noise using adaptation and mapping techniques , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[26]  Monika Woszczyna,et al.  Inferring linguistic structure in spoken language , 1994, ICSLP.

[27]  Mari Ostendorf,et al.  Prosodic and lexical indications of discourse structure in human-machine interactions , 1997, Speech Commun..

[28]  P Taylor,et al.  Intonation and dialogue context as constraints for speech recognition , 1998 .

[29]  Hajime Tsukada,et al.  Prosodic Features of Utterances in Task-Oriented Dialogues , 1997, Computing Prosody.

[30]  David R. Traum,et al.  Utterance units and grounding in spoken dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[31]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[32]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[33]  Elmar Nöth,et al.  Dialog act classification with the help of prosody , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  M. Swerts Prosodic features at discourse boundaries of different strength. , 1997, The Journal of the Acoustical Society of America.

[35]  Anton Batliner,et al.  Why Sentence Modality in Spontaneous Speech is More Difficult to Classify and why this Fact is not t , 1993 .

[36]  Elmar Nöth Prosodische Information in der automatischen Spracherkennung , 1991 .

[37]  G. Ayers Discourse functions of pitch range in spontaneous and read speech , 1994 .

[38]  Jacqueline Vaissière,et al.  Language-Independent Prosodic Features , 1983 .

[39]  Elmar Nöth,et al.  Prosodic scoring of word hypotheses graphs , 1995, EUROSPEECH.

[40]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[41]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[42]  Kazuyo Tanaka,et al.  Pitch pattern clustering of user utterances in human-machine dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[43]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[44]  E. Nöth,et al.  "Roger", "Sorry", "I'm still listening" : dialog guiding signals in information retrieval dialogs , 1994 .

[45]  Julia Hirschberg,et al.  Using Machine Learning to Identify Intonational Segments , 1998 .

[46]  Eleonora Blaauw,et al.  On the perceptual classification of spontaneous and read speech , 1995 .

[47]  Mitch Weintraub,et al.  Microphone-Independent Robust Signal Processing Using Probabilistic Optimum Filtering , 1994, HLT.

[48]  Norbert Reithinger,et al.  Dialogue act classification using language models , 1997, EUROSPEECH.

[49]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[50]  Rebecca J. Passonneau,et al.  Combining Multiple Knowledge Sources for Discourse Segmentation , 1995, ACL.

[51]  Mark G. Core,et al.  Coding Dialogs with the DAMSL Annotation Scheme , 1997 .

[52]  E. Weber Varieties of Questions in English Conversation , 1993 .

[53]  Jennifer Chu-Carroll,et al.  A Statistical Model for Discourse Act Recognition in Dialogue Interactions , 1998 .

[54]  Esther Janse,et al.  Perceptual identification of sentence type and the time-distribution of prosodic interrogativity markers in Dutch , 1997 .

[55]  Simon King,et al.  Using intonation to constrain language models in speech recognition , 1997, EUROSPEECH.

[56]  Norbert Reithinger,et al.  Utilizing Statistical Dialogue Act Processing in Verbrnobil , 1995, ACL.

[57]  Masaaki Nagata,et al.  First steps towards statistical modeling of dialogue to predict the speech act type of the next utterance , 1994, Speech Communication.

[58]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.

[59]  Ralf Kompe,et al.  Prosody in Speech Understanding Systems , 1997, Lecture Notes in Computer Science.

[60]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[61]  van V.J.J.P. Heuven,et al.  An anatomy of Dutch question intonation , 1997 .

[62]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[63]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[64]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[65]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.