Integrated analysis of speech and images as a probabilistic decoding process

Speech understanding and vision are the two most important modalities in human-human communication. However, the emulation of these by a computer faces fundamental difficulties due to noisy data, vague meanings, previously unseen objects or unheard words, occlusions, spontaneous speech effects, and context dependence. Thus, the interpretation processes on both channels are highly error-prone. This paper presents a new perspective on the problem of relating speech and image interpretations as a probabilistic decoding process. It is shown that such an integration scheme is robust regarding partial or erroneous interpretations. Furthermore, it is shown that implicit error correction strategies can be formulated in this probabilistic framework that lead to improved scene interpretation.

[1]  Sven Wachsmuth,et al.  Integrated Recognition and Interpretation of Speech for a Construction Task Domain , 1999, HCI.

[2]  J. R. Kender,et al.  From images to sentences via spatial relations , 1999, Proceedings Integration of Speech and Image Understanding.

[3]  Svetha Venkatesh,et al.  Combining NL processing and video data to query American Football , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[4]  Franz Kummert,et al.  Towards a Vision System for Supervising Assembly Processes , 1999 .

[5]  Alexander H. Waibel,et al.  Interactive recovery from speech recognition errors in speech user interfaces , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Yoshiaki Shirai,et al.  Helping computer vision by verbal and nonverbal communication , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[7]  Patrick Olivier,et al.  Automatic Depiction of Spatial Descriptions , 1994, AAAI.

[8]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[9]  Alex Waibel,et al.  Multimodal interfaces for multimedia information agents , 1997 .

[10]  Sven Wachsmuth,et al.  Multi-modal scene understanding using probabilistic models , 2003 .

[11]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[12]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[13]  H.-H. Nagel From video to language-a detour via logic vs. jumping to conclusions , 1999, Proceedings Integration of Speech and Image Understanding.

[14]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in Video by the Integration of Image and Natural Language Processing , 1997, IJCAI.

[15]  Hidehiko TANAKA Automatic Semantic Analysis of Television News Captions , 1998 .

[16]  Stephen S. Intille,et al.  Visual recognition of multiagent action , 1999 .

[17]  Katashi Nagao,et al.  Ubiquitous Talker: Spoken Language Interaction with Real World Objects , 1995, IJCAI.

[18]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[19]  Sven J. Dickinson,et al.  PLAYBOT A visually-guided robot for physically disabled children , 1998, Image Vis. Comput..

[20]  Thomas B. Moeslund,et al.  The intellimedia workbench - a generic environment for multimodal systems , 1998, ICSLP.

[21]  Jay G. Wilpon,et al.  SAM: a perceptive spoken language-understanding robot , 1992, IEEE Trans. Syst. Man Cybern..

[22]  Franz Kummert,et al.  Hybrid object recognition in image sequences , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[23]  Rohini K. Srihari,et al.  Computational models for integrating linguistic and visual information: A survey , 2004, Artificial Intelligence Review.

[24]  Sven Wachsmuth,et al.  An integrated system for cooperative man-machine interaction , 2001, Proceedings 2001 IEEE International Symposium on Computational Intelligence in Robotics and Automation (Cat. No.01EX515).

[25]  Gernot A. Fink Developing HMM-Based Recognizers with ESMERALDA , 1999, TSD.

[26]  Sven Wachsmuth,et al.  Bayesian networks for speech and image integration , 2002, AAAI/IAAI.