A Proposal for Processing and Fusioning Multiple Information Sources in Multimodal Dialog Systems

Multimodal dialog systems can be defined as computer systems that process two or more user input modes and combine them with multimedia system output. This paper is focused on the multimodal input, providing a proposal to process and fusion the multiple input modalities in the dialog manager of the system, so that a single combined input is used to select the next system action. We describe an application of our technique to build multimodal systems that process user’s spoken utterances, tactile and keyboard inputs, and information related to the context of the interaction. This information is divided in our proposal into external and internal context, user’s internal, represented in our contribution by the detection of their intention during the dialog and their emotional state.

[1]  Minh Tue Vo,et al.  Building an application framework for speech and pen input integration in multimodal learning interfaces , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Michael Johnston,et al.  Unification-based Multimodal Parsing , 1998, ACL.

[3]  S. Khorram,et al.  Data fusion using artificial neural networks: a case study on multitemporal change analysis , 1999 .

[4]  Wolfgang Minker,et al.  Design considerations for knowledge source representations of a stochastically-based natural language understanding component , 1999, Speech Commun..

[5]  Sharon L. Oviatt,et al.  From members to teams to committee-a robust approach to gestural and multimodal recognition , 2002, IEEE Trans. Neural Networks.

[6]  David Traum,et al.  The Information State Approach to Dialogue Management , 2003 .

[7]  Grace Chung,et al.  Developing a Flexible Spoken Dialog System Using Simulation , 2004, ACL.

[8]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[9]  Masahiro Araki,et al.  Spoken, Multilingual and Multimodal Dialogue Systems: Development and Assessment , 2005 .

[10]  Wolfgang Wahlster,et al.  SmartKom: Foundations of Multimodal Dialogue Systems , 2006, SmartKom.

[11]  Sebastian Möller,et al.  Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations , 2006, INTERSPEECH.

[12]  Steve J. Young,et al.  A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies , 2006, The Knowledge Engineering Review.

[13]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[14]  Ramón López-Cózar,et al.  Influence of contextual information in emotion annotation for spoken dialogue systems , 2008, Speech Commun..

[15]  David Griol,et al.  A statistical approach to spoken dialog systems design and evaluation , 2008, Speech Commun..

[16]  Bruno Dumas Frameworks, description languages and fusion engines for multimodal interactive systems , 2010 .

[17]  Ramón López-Cózar,et al.  Multimodal Dialogue for Ambient Intelligence and Smart Environments , 2010, Handbook of Ambient Intelligence and Smart Environments.

[18]  Nuno J. Mamede,et al.  Ambient Intelligence Interaction via Dialogue Systems , 2010 .

[19]  Feng Gao,et al.  Spoken language understanding using weakly supervised learning , 2010, Comput. Speech Lang..

[20]  Fang Chen,et al.  Chapter 12 – Multimodal Input , 2010 .

[21]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[22]  James R. Lewis The Voice in the Machine: Building Computers That Understand Speech , 2012, Int. J. Hum. Comput. Interact..

[23]  David Griol,et al.  Bringing context-aware access to the web through spoken interaction , 2013, Applied Intelligence.

[24]  David Griol,et al.  A statistical simulation technique to develop and evaluate conversational agents , 2013, AI Commun..

[25]  John Mourjopoulos,et al.  Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing , 2013, Comput. Speech Lang..

[26]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..