Speech Repairs, Intonational Boundaries and Discourse Markers: Modeling Speakers' Utterances in Spoken Dialog

In this thesis, we present a statistical language model for resolving speech repairs, intonational boundaries and discourse markers. Rather than finding the best word interpretation for an acoustic signal, we redefine the speech recognition problem to so that it also identifies the POS tags, discourse markers, speech repairs and intonational phrase endings (a major cue in determining utterance units). Adding these extra elements to the speech recognition problem actually allows it to better predict the words involved, since we are able to make use of the predictions of boundary tones, discourse markers and speech repairs to better account for what word will occur next. Furthermore, we can take advantage of acoustic information, such as silence information, which tends to co-occur with speech repairs and intonational phrase endings, that current language models can only regard as noise in the acoustic signal. The output of this language model is a much fuller account of the speaker's turn, with part-of-speech assigned to each word, intonation phrase endings and discourse markers identified, and speech repairs detected and corrected. In fact, the identification of the intonational phrase endings, discourse markers, and resolution of the speech repairs allows the speech recognizer to model the speaker's utterances, rather than simply the words involved, and thus it can return a more meaningful analysis of the speaker's turn for later processing.

[1]  S. Garrod,et al.  Saying what you mean in dialogue: A study in conceptual and semantic co-ordination , 1987, Cognition.

[2]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[3]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[4]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[5]  James F. Allen,et al.  Incorporating POS tagging into language modeling , 1997, EUROSPEECH.

[6]  Wayne H. Ward Understanding spontaneous speech: the Phoenix system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[7]  John Bear,et al.  Integrating Multiple Knowledge Sources for Detection and Correction of Repairs in Human-Computer Dialog , 1992, ACL.

[8]  James F. Allen,et al.  A Study on Prosody and Discourse Structure in Cooperative Dialogues , 1993 .

[9]  William W. Cohen Efficient Pruning Methods for Separate-and-Conquer Rule Learning Systems , 1993, IJCAI.

[10]  Elmar Nöth,et al.  Prosodic scoring of word hypotheses graphs , 1995, EUROSPEECH.

[11]  Lynette Hirschman,et al.  Multi-Site Data Collection for a Spoken Language Corpus , 1992, HLT.

[12]  Robert C. Moore,et al.  Gemini: a natural language system for spoken-language understanding , 1993 .

[13]  Thomas Niesler,et al.  A variable-length category-based n-gram language model , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  David R. Traum,et al.  Utterance Units in Spoken Dialogue , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[15]  Eric K. Ringger,et al.  A Robust System for Natural Spoken Dialogue , 1996, ACL.

[16]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[17]  Elizabeth R. Blacfkmer,et al.  Theories of monitoring and the timing of repairs in spontaneous speech , 1991, Cognition.

[18]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[19]  James F. Allen,et al.  Combining the detection and correction of speech repairs , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Lenhart K. Schubert,et al.  Knowledge Representation in the TRAINS-93 Conversation System , 1996 .

[21]  Philippe Bretier,et al.  Effective human-computer cooperative spoken dialogue: the AGS demonstrator , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[22]  Tsuyoshi Morimoto,et al.  Similarity-based identification of repairs in Japanese spoken language , 1994, ICSLP.

[23]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[24]  Peter A. Heeman,et al.  DETECTING AND CORRECTING SPEECH REPAIRS IN JAPANESE , 1999 .

[25]  Harald Singer,et al.  "Pause Units" and Analysis of Spontaneous Japanese Dialogues: Preliminary Studies , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[26]  Mitchell P. Marcus,et al.  Description theory and intonation boundaries , 1991 .

[27]  Mari Ostendorf,et al.  Parse scoring with prosodic information: an analysis/synthesis approach , 1993, Comput. Speech Lang..

[28]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[29]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[30]  Hans W. Dechert,et al.  Hesitancy as a conversational resource: Some methodological implications , 1980 .

[31]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Noboru Ohnishi,et al.  A Parser Coping With Self-Repaired Japanese Utterances And Large Corpus-Based Evaluation , 1994, COLING.

[33]  Harald Höge,et al.  A new keyword spotting algorithm with pre-calculated optimal thresholds , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  Robin J. Lickley,et al.  On not recognizing disfluencies in dialogue , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[35]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[36]  Richard M. Schwartz,et al.  The N-Best Algorithm: Efficient Procedure for Finding Top N Sentence Hypotheses , 1989, HLT.

[37]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[38]  Philippe Bretier,et al.  ARTIMIS: Natural Dialogue Meets Rational Agency , 1997, IJCAI.

[39]  Graeme Hirst,et al.  Collaborating on Referring Expressions , 1991, CL.

[40]  Paul Gorrell Syntax and Parsing , 1995 .

[41]  Alon Lavie,et al.  Input Segmentation of Spontaneous Speech in JANUS: A Speech-to-speech Translation System , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[42]  C H Nakatani,et al.  A corpus-based study of repair cues in spontaneous speech. , 1994, The Journal of the Acoustical Society of America.

[43]  M. Halliday NOTES ON TRANSITIVITY AND THEME IN ENGLISH. PART 2 , 1967 .

[44]  Michael Matessa,et al.  Using pragmatic and semantic knowledge to correct parsing of spoken language utterances , 1991, EUROSPEECH.

[45]  Julia Hirschberg,et al.  User Participation in the Reasoning Processes of Expert Systems , 1982, AAAI.

[46]  Alexander H. Waibel,et al.  Recognition of conversational telephone speech using the JANUS speech engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Andrei Mikheev Unsupervised Learning of Word-Category Guessing Rules , 1996, ACL.

[48]  Mari Ostendorf,et al.  The use of prosody in syntactic disambiguation , 1991 .

[49]  John D. Lafferty,et al.  Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech , 1992, HLT.

[50]  Elmar Nöth,et al.  Automatic classification of prosodically marked phrase boundaries in German , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  Ronald Rosenfeld,et al.  The CMU Statistical Language Modeling Toolkit and its use in the 1994 ARPA CSR Evaluation , 1995 .

[52]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[53]  Julia Hirschberg Using text analysis to predict intonational boundaries , 1991, EUROSPEECH.

[54]  Rukmini Iyer,et al.  Modeling Conversational Speech for Speech Recognition , 1996, EMNLP.

[55]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[56]  Douglas E. Appelt,et al.  GEMINI: A Natural Language System for Spoken-Language Understanding , 1993, ACL.

[57]  James F. Allen,et al.  The TRAINS 93 Dialogues , 1995 .

[58]  Robin J. Lickley,et al.  Processing disfluent speech: recognising disfluency before lexical access , 1992, ICSLP.

[59]  E. Schegloff,et al.  The preference for self-correction in the organization of repair in conversation , 1977 .

[60]  Peter A. Heeman,et al.  Intonational boundaries, speech repairs and discourse markers: modeling spoken dialog , 1997 .

[61]  Frederick Jelinek,et al.  Towards history-based grammars: using richer models for probabilistic parsing , 1992 .

[62]  Wayne A. Lea,et al.  Prosodic Aids to Speech Recognition , 1972 .

[63]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[64]  Mari Ostendorf,et al.  Modeling disfluencies in conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65]  J. Austin How to do things with words , 1962 .

[66]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[67]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[68]  Makoto Nagao,et al.  Dynamic Programming Method for Analyzing Conjunctive Structures in Japanese , 1992, COLING.

[69]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[70]  Julia Hirschberg,et al.  Empirical Studies on the Disambiguation of Cue Phrases , 1993, Comput. Linguistics.

[71]  James F. Allen,et al.  Dialogue Transcription Tools , 1995 .

[72]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[73]  John Bear,et al.  Prosody, Syntax and Parsing , 1990, ACL.

[74]  Ronald A. Cole,et al.  A prototype voice-response questionnaire for the u.s. census , 1994, ICSLP.

[75]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[76]  Srinivas Bangalore,et al.  The Institute For Research In Cognitive Science Disambiguation of Super Parts of Speech ( or Supertags ) : Almost Parsing by Aravind , 1995 .

[77]  Douglas D. O'Shaughnessy Analysis of false starts in spontaneous speech , 1992, ICSLP.

[78]  Julia Hirschberg,et al.  The intonational Structuring of Discourse , 1986, ACL.

[79]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[80]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[81]  James F. Allen,et al.  Tagging Speech Repairs , 1994, HLT.

[82]  Chung Hee Hwang,et al.  The TRAINS project: a case study in building a conversational planning agent , 1994, J. Exp. Theor. Artif. Intell..

[83]  Wolfgang Wahlster,et al.  Verbmobil: Translation of Face-To-Face Dialogs , 1993, MTSUMMIT.

[84]  John D. Lafferty,et al.  Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[85]  G. Ayers,et al.  Guidelines for ToBI labelling , 1994 .

[86]  Robin Cohen,et al.  A Computational Theory of the Function of Clue Words in Argument Understanding , 1984, ACL.

[87]  Julia Hirschberg,et al.  Now Let’s Talk About Now; Identifying Cue Phrases Intonationally , 1987, ACL.

[88]  A. Koller,et al.  Speech Acts: An Essay in the Philosophy of Language , 1969 .

[89]  Peter A. Heeman,et al.  Discourse marker use in task-oriented spoken dialog \lambda , 1997, EUROSPEECH.

[90]  M. Steedman,et al.  Syntax and intonational structure in a combinatory grammar , 1991 .

[91]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[92]  Elmar Nöth,et al.  Dialog act classification with the help of prosody , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[93]  Cheryl M. Beach,et al.  The interpretation of prosodic patterns at points of syntactic structure ambiguity: Evidence for cue trading relations☆ , 1991 .

[94]  R. Rosenfeld,et al.  ERROR ANALYSIS AND DISFLUENCY MODELING IN THE SWITCHBOARD DOMAIN , 1996 .

[95]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[96]  James F. Allen,et al.  The Trains 91 Dialogues , 1993 .