Intonation and dialogue context as constraints for speech recognition

This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 “move types,” such as “acknowledge,” “query-yes/no” or “instruct.” Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the “correct” move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to “initiating” move types, where word recognition is important. In “response” move types, where the important information is conveyed by the move type itself—for example, positive versus negative response—there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.

[1]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[2]  Stephen G. Pulman,et al.  A speech-based route enquiry system built from general-purpose components , 1993, EUROSPEECH.

[3]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[4]  Chung Hee Hwang,et al.  The TRAINS project: a case study in building a conversational planning agent , 1994, J. Exp. Theor. Artif. Intell..

[5]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[6]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[7]  Anne H. Anderson,et al.  The DCIEM map task corpus: spontaneous dialogue under sleep deprivation and drug treatment , 1996, ICSLP.

[8]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[9]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[10]  Norbert Reithinger,et al.  Predicting dialogue acts for a speech-to-speech translation system , 1996 .

[11]  Marilyn A. Walker,et al.  Standards for Dialogue Coding in Natural Language Processing , 1997 .

[12]  Norbert Reithinger,et al.  Dialogue act classification using language models , 1997, EUROSPEECH.

[13]  Jacqueline C. Kowtko,et al.  The function of intonation in task-oriented dialogue , 1996 .

[14]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[15]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[16]  Norbert Reithinger,et al.  Predicting dialogue acts for a speech-to-speech translation system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[18]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[19]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? , 1998, Language and speech.

[20]  Morena Danieli,et al.  Contextual Information and Specific Language Models for Spoken Language Understanding , 1997, ArXiv.

[21]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[22]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[23]  Heinrich Niemann,et al.  Combining stochastic and linguistic language models for recognition of spontaneous speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  J. O'connor Intonation Of Colloquial English , 1961 .

[25]  Mari Ostendorf,et al.  A dynamical system model for recognizing intonation patterns , 1995, EUROSPEECH.