Discourse structure for multi-speaker spontaneous spoken dialogs: incorporating heuristics into stochastic RTNs

In real spoken language applications, where speakers interact spontaneously, there is much seeming unpredictability that makes recognition difficult. Multi-speaker spontaneous dialog where two speakers interact verbally to cooperatively solve a mutual, shared problem is more varied than human-computer interactions. Spontaneous speech is not well structured, exhibiting mid-utterance corrections and restarts in utterances. Discourse contains digressions, clarifications, corrections and topic changes. But, multi-speaker discourse is even more varied, with initiative effects, speakers interacting, planning and responding. This makes it extremely difficult to develop grammars and language models with adequate coverage and reliable stochastic parameters. Perplexity increases and recognition degrades considerably vis-a-vis human-database dialog. In spite of all this, multi-speaker dialogs are structured and predictable when the discourse is appropriately modelled. We have developed heuristics to model spontaneous speech and multi-speaker dialogs. The underlying heuristics have been evaluated and shown to adequately and accurately predict discourse phenomena, as evaluated on a 10,000+ utterance corpus. Generally, the heuristics for computing discourse structure and deriving constraints from it are rule based. We have taken the rules and used them to develop a set of stochastic RTNs that capture both the rules and corpus probabilities. The resulting language model can be used predictively to dynamically generate stochastic utterance predictions or can be incorporated into any recognition/understanding system where a single prior state is maintained.