Virtual Modality: a Framework for Testing and Building Multimodal Applications

This paper introduces a method that generates simulated multimodal input to be used in testing multimodal system implementations, as well as to build statistically motivated multimodal integration modules. The generation of such data is inspired by the fact that true multimodal data, recorded from real usage scenarios, is difficult and costly to obtain in large amounts. On the other hand, thanks to operational speech-only dialogue system applications, a wide selection of speech/text data (in the form of transcriptions, recognizer outputs, parse results, etc.) is available. Taking the textual transcriptions and converting them into multimodal inputs in order to assist multimodal system development is the underlying idea of the paper. A conceptual framework is established which utilizes two input channels: the original speech channel and an additional channel called Virtual Modality. This additional channel provides a certain level of abstraction to represent non-speech user inputs (e.g., gestures or sketches). From the transcriptions of the speech modality, pre-defined semantic items (e.g., nominal location references) are identified, removed, and replaced with deictic references (e.g., here, there). The deleted semantic items are then placed into the Virtual Modality channel and, according to external parameters (such as a pre-defined user population with various deviations), temporal shifts relative to the instant of each corresponding deictic reference are issued. The paper explains the procedure followed to create Virtual Modality data, the details of the speech-only database, and results based on a multimodal city information and navigation application.

[1]  Stephanie Seneff,et al.  A context resolution server for the galaxy conversational systems , 2003, INTERSPEECH.

[2]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[3]  Péter Pál Boda,et al.  Efficient combination of type-in and wizard-of-oz tests in speech interface development process , 2002, INTERSPEECH.

[4]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[5]  Sharon L. Oviatt,et al.  Toward a theory of organized multimodal integration patterns during human-computer interaction , 2003, ICMI '03.

[6]  Michael H. Coen,et al.  Multimodal Integration A Biological View , 2001, IJCAI.

[7]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[8]  Joseph Polifroni,et al.  Galaxy-II as an Architecture for Spoken Dialogue Evaluation , 2000, LREC.

[9]  Victor Zue,et al.  PEGASUS: A spoken dialogue interface for on-line air travel planning , 1994, Speech Communication.

[10]  Victor Zue,et al.  Multilingual spoken-language understanding in the MIT Voyager system , 1995, Speech Commun..

[11]  Stephanie Seneff,et al.  Response planning and generation in the MERCURY flight reservation system , 2002, Comput. Speech Lang..

[12]  Sy Bor Wang,et al.  A multimodal galaxy-based geographic system , 2003 .

[13]  Stephanie Seneff,et al.  ORION: from on-line interaction to off-line delegation , 2000, INTERSPEECH.

[14]  P. Haikonen The Cognitive Approach to Conscious Machines , 2003 .

[15]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[16]  Ivan Marsic,et al.  Flexible User Interfaces for Group Collaboration , 2003, Int. J. Hum. Comput. Interact..