Toward Widely-Available and Usable Multimodal Conversational Interfaces

Multimodal conversational interfaces, which allow humans to interact with a computer using a combination of spoken natural language and a graphical interface, offer the potential to transform the manlier by which humans communicate with computers. While researchers have developed myriad such interfaces, none have made the transition out of the laboratory and into the hands of a significant number of users. This thesis makes progress toward overcoming two intertwined barriers preventing more widespread adoption: availability and usability. Toward addressing the problem of availability, this thesis introduces a new platform for building multimodal interfaces that makes it easy to deploy them to users via the World Wide Web. One consequence of this work is City Browser the first, multimodal conversational interface made publicly available to anyone with a web browser and a microphone. City Browser serves as a proof-of-concept that significant amounts of usage data can be collected in this way, allowing a glimpse of how users interact with such interfaces outside of a laboratory environment. City Browser, in turn, has served as the primary platform for deploying and evaluating three new strategies aimed at improving usability. The most pressing usability challenge for conversational interfaces is their limited ability to accurately transcribe and understand spoken natural language. The three strategies developed in this thesis context-sensitive language modeling, response confidence scoring, and user behavior shaping - each attack the problem from a different angle, but they are linked in that each critically integrates information from the conversational context. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Oliver Lemon,et al.  Combining Acoustic and Pragmatic Features to Predict Recognition Performance in Spoken Dialogue Systems , 2004, ACL.

[2]  Stanley Peters,et al.  The WITAS multi-modal dialogue system I , 2001, INTERSPEECH.

[3]  James Glass,et al.  A Multimodal Home Entertainment Interface via a Mobile Device , 2008, ACL 2008.

[4]  M Matthias Gary,et al.  Incremental speech understanding in a multimodal web-based spoken dialogue system , 2009 .

[5]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[6]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[7]  T. Kuhn,et al.  The Structure of Scientific Revolutions. , 1964 .

[8]  Jesse James Garrett Ajax: A New Approach to Web Applications , 2007 .

[9]  Joseph Polifroni,et al.  A new restaurant guide conversational system: issues in rapid prototyping for specialized domains , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Bhuvana Ramabhadran,et al.  Speech recognition for DARPA Communicator , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Alexander Gruenstein Response-Based Confidence Annotation for Spoken Dialogue Systems , 2008, SIGDIAL Workshop.

[12]  Stephanie Seneff,et al.  Dialogue Management in the Mercury Flight Reservation System , 2000 .

[13]  Oliver Lemon,et al.  Learning dialogue strategies for interactive database search , 2007, INTERSPEECH.

[14]  Stephanie Seneff,et al.  Mandarin Learning Using Speech and Language Technologies: A Translation Game in the Travel Domain , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[15]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[16]  Sharon L. Oviatt,et al.  Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[17]  Kuansan Wang SALT: a spoken language interface for web-based multimodal dialog systems , 2002, INTERSPEECH.

[18]  Julia Hirschberg,et al.  Predicting Automatic Speech Recognition Performance Using Prosodic Cues , 2000, ANLP.

[19]  Amanda Stent,et al.  The CommandTalk Spoken Dialogue System , 1999, ACL.

[20]  Victor Zue,et al.  WebGALAXY: Beyond Point and Click - a Conversational Interface to a Browser , 1997, Comput. Networks.

[21]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[22]  Kallirroi Georgila,et al.  User simulation for spoken dialogue systems: learning and evaluation , 2006, INTERSPEECH.

[23]  Eric Fosler-Lussier,et al.  Adaptive language models for spoken dialogue systems , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Stephanie Seneff,et al.  A dynamic vocabulary spoken dialogue interface , 2004, INTERSPEECH.

[25]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[26]  Gareth M. James,et al.  Challenges For Spoken Dialogue Systems , 1999 .

[27]  Verena Rieser,et al.  The SAMMIE Corpus of Multimodal Dialogues with an MP3 Player , 2006, LREC.

[28]  Stephanie Seneff,et al.  Rainbow rummy: a web-based game for vocabulary acquisition using computer-directed speech , 2009, SLaTE.

[29]  Hua Ai,et al.  Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users , 2007, SIGDIAL.

[30]  James F. Allen,et al.  An architecture for more realistic conversational systems , 2001, IUI '01.

[31]  Eric K. Ringger,et al.  Rapid language model development for new task domains , 1998 .

[32]  Alexander H. Waibel,et al.  Multimodal error correction for speech user interfaces , 2001, TCHI.

[33]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[34]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[35]  Stephanie Seneff,et al.  Response planning and generation in the MERCURY flight reservation system , 2002, Comput. Speech Lang..

[36]  Stephanie Seneff,et al.  CONTEXT-SENSITIVE LANGUAGE MODELING FOR LARGE SETS OF PROPER NOUNS IN MULTIMODAL DIALOGUE SYSTEMS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[37]  Gregory A. Sanders,et al.  DARPA communicator: cross-system results for the 2001 evaluation , 2002, INTERSPEECH.

[38]  Stephanie Seneff,et al.  Scalable and portable web-based multimodal dialogue interaction with geographical databases , 2006, INTERSPEECH.

[39]  Alexander I. Rudnicky,et al.  A principled approach for rejection threshold optimization in spoken dialog systems , 2005, INTERSPEECH.

[40]  James Glass,et al.  The VOYAGER speech understanding system: preliminary development and evaluation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[41]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[42]  Bernard Renger,et al.  A Multimodal Interface for Access to Content in the Home , 2007, ACL.

[43]  Mikko Pohja,et al.  Multimodal interaction with xforms , 2006, ICWE '06.

[44]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[45]  Ian McGraw,et al.  The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces , 2008, ICMI '08.

[46]  Alexander I. Rudnicky,et al.  N-best speech hypotheses reordering using linear regression , 2001, INTERSPEECH.

[47]  Marilyn A. Walker,et al.  MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.

[48]  Hermann Ney,et al.  A COMPARISON OF DIALOGUE-STATE DEPENDENT LANGUAGE MODELS , 2007 .

[49]  Lin Lawrence Chase,et al.  Word and acoustic confidence annotation for large vocabulary speech recognition , 1997, EUROSPEECH.

[50]  James Glass,et al.  Integration of speech recognition and natural language processing in the MIT VOYAGER system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[51]  Manny Rayner,et al.  Adding intelligent help to mixed-initiative spoken dialogue systems , 2002, INTERSPEECH.

[52]  Stanley Peters,et al.  Collaborative activities and multi-tasking in dialogue systems , 2002 .

[53]  Oliver Lemon,et al.  Targeted help for spoken dialogue systems: intelligent feedback improves naive users' performance , 2003 .

[54]  Wei Xu,et al.  Language modeling for dialog system , 2000, INTERSPEECH.

[55]  Helen F. Hastie,et al.  Context-sensitive help for multimodal dialogue , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[56]  Steven K. Feiner,et al.  Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality , 2003, ICMI '03.

[57]  Oliver Lemon,et al.  multithreaded context for robust conversational interfaces: Context-sensitive speech recognition and interpretation of corrective fragments , 2004, TCHI.

[58]  Alexander I. Rudnicky,et al.  Integrating Multiple Knowledge Sources for Utterance-Level Confidence Annotation in the CMU Communicator Spoken Dialog System , 2002 .

[59]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[60]  Alexander I. Rudnicky,et al.  Towards efficient human machine speech communication: The speech graffiti project , 2005, TSLP.

[61]  Tsuneo Nitta,et al.  XISL: a language for describing multimodal interaction scenarios , 2003, ICMI '03.

[62]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[63]  Alexander I. Rudnicky,et al.  Olympus: an open-source framework for conversational spoken language interface research , 2007, HLT-NAACL 2007.

[64]  Alexander I. Rudnicky,et al.  Ravenclaw: dialog management using hierarchical task decomposition and an expectation agenda , 2003, INTERSPEECH.

[65]  William I. Hallahan DECtalk Software: Text-to-Speech Technology and Implementation , 1995, Digit. Tech. J..

[66]  Stephanie Seneff,et al.  Automatic induction of language model data for a spoken dialogue system , 2006, SIGDIAL.

[67]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[68]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[69]  Douglas E. Appelt,et al.  GEMINI: A Natural Language System for Spoken-Language Understanding , 1993, ACL.

[70]  Anna Hjalmarsson Evaluating AdApt, a multi-modal conversational, dialogue system, using PARADISE , 2003 .

[71]  Wayne H. Ward,et al.  Confidence measures for dialogue management in the CU Communicator system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[72]  H. Bratt,et al.  CHAT: a conversational helper for automotive tasks , 2006, INTERSPEECH.

[73]  Johan Schalkwyk,et al.  Speech recognition with dynamic grammars using finite-state transducers , 2003, INTERSPEECH.

[74]  Ingrid Zukerman,et al.  Towards a probabilistic, multi-layered spoken language interpretation system , 2005, IJCAI 2005.

[75]  Stephanie Seneff,et al.  An interactive interpretation game for learning Chinese , 2007, SLaTE.

[76]  Stephanie Seneff,et al.  A context resolution server for the galaxy conversational systems , 2003, INTERSPEECH.

[77]  Karthik Visweswariah,et al.  Language models conditioned on dialog state , 2001, INTERSPEECH.

[78]  Jeremy H. Wright,et al.  Using Natural Language Processing and Discourse Features to Identify Understanding Errors in a Spoken Dialogue System , 2000 .

[79]  Stephanie Seneff,et al.  Context-sensitive statistical language modeling , 2005, INTERSPEECH.

[80]  Amir Najmi,et al.  An interactive dialog system for learning Japanese , 2000, Speech Commun..

[81]  James R. Glass,et al.  City browser: developing a conversational automotive HMI , 2009, CHI Extended Abstracts.

[82]  Jens Edlund,et al.  Adapt - a multimodal conversational dialogue system in an apartment domain , 2000, INTERSPEECH.

[83]  Wayne H. Ward,et al.  Recent Improvements in the CMU Spoken Language Understanding System , 1994, HLT.

[84]  Stephanie Seneff,et al.  Releasing a Multimodal Dialogue System into the Wild: User Support Mechanisms , 2007, SIGDIAL.

[85]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[86]  Stephanie Seneff,et al.  GENESIS-II: a versatile system for language generation in conversational system applications , 2000, INTERSPEECH.

[87]  Stephanie Seneff,et al.  Speech-enabled Card Games for Language Learners , 2008, AAAI.

[88]  Ellen Campana,et al.  Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods , 2007 .