Incremental speech understanding in a multimodal web-based spoken dialogue system

In most spoken dialogue systems, the human speaker interacting with the system must wait until after finishing speaking to find out whether his or her speech has been accurately understood. The verbal and nonverbal indicators of understanding typical in human-to-human interaction are generally nonexistent in automated systems, resulting in an interaction that feels unnatural to the human user. However, as automatic speech recognition gets incorporated into web-based and portable interfaces, there are now graphical means in addition to the verbal means by which a spoken dialogue system can communicate to the user. In this thesis, we present a multimodal web-based spoken dialogue system that incorporates incremental understanding of human speech. Through incremental understanding, the system can display to the user its current understanding of specific concepts in real-time while the user is still in the process of uttering a sentence. In addition, the user can interact with the system through nonverbal input modalities such as typing and mouse clicking. We evaluate the results of a comparative user study in which one group uses a configuration that receives incremental concept understanding, while another group uses a configuration that lacks this feature. We found that the group receiving incremental updates had a greater task completion rate and overall user satisfaction. Thesis Supervisor: James R. Glass Title: Principal Research Scientist Thesis Supervisor: Stephanie Seneff Title: Principal Research Scientist

[1]  Michael Johnston,et al.  Robust multimodal understanding , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Stephanie Seneff,et al.  Speech-enabled Card Games for Language Learners , 2008, AAAI.

[3]  Linda Dailey Paulson,et al.  Building Rich Web Applications with Ajax , 2005, Computer.

[4]  Michael Johnston,et al.  MATCHkiosk: A Multimodal Interactive City Guide , 2004, ACL.

[5]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Ian McGraw,et al.  The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces , 2008, ICMI '08.

[7]  Martha E. Pollack,et al.  Incremental Interpretation , 1991, Artif. Intell..

[8]  Ellen Campana,et al.  Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods , 2007 .

[9]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[10]  Stephanie Seneff,et al.  Dialogue Management in the Mercury Flight Reservation System , 2000 .

[11]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Ellen Campana,et al.  Software architectures for incremental understanding of human speech , 2006, INTERSPEECH.

[13]  Stephanie Seneff,et al.  A context resolution server for the galaxy conversational systems , 2003, INTERSPEECH.

[14]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.

[15]  Marilyn A. Walker,et al.  MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.

[16]  Mikio Nakano,et al.  Spoken dialogue understanding using an incremental speech understanding method , 2005, Systems and Computers in Japan.

[17]  Stephanie Seneff,et al.  Response planning and generation in the MERCURY flight reservation system , 2002, Comput. Speech Lang..

[18]  R. Young,et al.  Identifying units in interaction: Reactive tokens in Korean and English conversations , 2004 .

[19]  V. Yngve On getting a word in edgewise , 1970 .

[20]  Victor Zue,et al.  Webgalaxy - integrating spoken language and hypertext navigation , 1997, EUROSPEECH.

[21]  Mikio Nakano,et al.  A method for evaluating incremental utterance understanding in spoken dialogue systems , 2002, INTERSPEECH.

[22]  Joseph Polifroni,et al.  Formal and natural language generation in the Mercury conversational system , 2000, INTERSPEECH.

[23]  James R. Glass,et al.  City browser: developing a conversational automotive HMI , 2009, CHI Extended Abstracts.

[24]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Stephanie Seneff,et al.  Scalable and portable web-based multimodal dialogue interaction with geographical databases , 2006, INTERSPEECH.

[27]  James F. Allen,et al.  Production in a Multimodal Corpus: how Speakers Communicate Complex Actions , 2008, LREC.

[28]  Carlos Gómez Gallo,et al.  Incremental Syntactic Planning across Clauses , 2008 .

[29]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[30]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[31]  William I. Hallahan DECtalk Software: Text-to-Speech Technology and Implementation , 1995, Digit. Tech. J..

[32]  James F. Allen,et al.  Incremental Dialogue System Faster than and Preferred to its Nonincremental Counterpart , 2007 .

[33]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.

[34]  Tim Paek,et al.  People watcher: a game for eliciting human-transcribed data for automated directory assistance , 2007, INTERSPEECH.

[35]  Stephanie Seneff,et al.  Rainbow rummy: a web-based game for vocabulary acquisition using computer-directed speech , 2009, SLaTE.

[36]  Stephen J. Pueblo,et al.  Videorealistic facial animation for speech-based interfaces , 2009 .

[37]  Stephanie Seneff,et al.  Releasing a Multimodal Dialogue System into the Wild: User Support Mechanisms , 2007, SIGDIAL.

[38]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[39]  Stephanie Seneff,et al.  GENESIS-II: a versatile system for language generation in conversational system applications , 2000, INTERSPEECH.

[40]  James R. Glass,et al.  Real-time probabilistic segmentation for segment-based speech recognition , 1998, ICSLP.

[41]  Mikio Nakano,et al.  Understanding Unsegmented User Utterances in Real-Time Spoken Dialogue Systems , 1999, ACL.

[42]  Alexander Gruenstein Shape Game A Multimodal Game Featuring Incremental Understanding , 2007 .

[43]  James Glass,et al.  A Multimodal Home Entertainment Interface via a Mobile Device , 2008, ACL 2008.

[44]  Victor Zue,et al.  Recent Progress on the SUMMIT System , 1990, HLT.

[45]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[46]  Franz Kummert,et al.  Incremental speech recognition for multimodal interfaces , 1998, IECON '98. Proceedings of the 24th Annual Conference of the IEEE Industrial Electronics Society (Cat. No.98CH36200).

[47]  Li Deng,et al.  Distributed speech processing in miPad's multimodal user interface , 2002, IEEE Trans. Speech Audio Process..