Integration of Vision and Speech Understanding Using Bayesian Networks

The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high-level descriptions miss much information when concerned with erroneous, vague, or incomplete data. We propose a new architecture that integrates various levels of processing by using multiple representations of the visually observed scene. The representations are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene. The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene representation.

[1]  Hans-Hellmut Nagel,et al.  Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen , 1993, Informatik - Forschung und Entwicklung.

[2]  Sven J. Dickinson,et al.  PLAYBOT A visually-guided robot for physically disabled children , 1998, Image Vis. Comput..

[3]  Stefan Posch,et al.  3-D Reconstruction and Camera Calibration from Images with Known Objects , 1995, BMVC.

[4]  Gerhard Sagerer,et al.  Projective Relations for 3D Space: Computational Model, Application, and Psychological Evaluation , 1997, AAAI/IAAI.

[5]  Helge J. Ritter,et al.  A Hybrid Object Recognition Architecture , 1996, ICANN.

[6]  Fabio Lavagetto,et al.  Lip motion modeling and speech driven estimation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Gudrun Socher,et al.  Qualitative scene descriptions from images for integrated speech and image understanding , 1997, DISKI.

[8]  W. Wahister One word says more than a thousand pictures: on the automatic verbalization of the results of image sequence analysis system , 1987 .

[9]  Allan D. Jepson,et al.  Priors, preferences and categorical percepts , 1996 .

[10]  David L. Waltz Generating and Understanding Scene Descriptions. , 1980 .

[11]  Franz Kummert,et al.  Hybrid object recognition in image sequences , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[12]  Sven Wachsmuth,et al.  Integration of parsing and incremental speech recognition , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[13]  David D. McDonald,et al.  Salience as a Simplifying Metaphor for Natural Language Generation , 1982, AAAI.

[14]  Intelligent Multimedia Interfaces, the book is an outgrowth of the AAAI Workshop on Intelligent Multimedia Interfaces, Anaheim, CA, USA, August, 1991 , 1993, Intelligent Multimedia Interfaces.

[15]  Lynne E. Bernstein,et al.  For speech perception by humans or machines, three senses are better than one , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  Katashi Nagao Abduction and Dynamic Preference in Plan-Based Dialogue Understanding , 1993, IJCAI.

[17]  Gerhard Sagerer,et al.  A three-dimensional spatial model for the interpretation of image data , 1998, IJCAI 1995.

[18]  John K. Tsotsos,et al.  Real-time Model-based Tracking Using Perspective Alignment: Parallel Implementation and Stability Analysis , 1995, Research in Computer and Robot Vision.

[19]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[20]  Sven Wachsmuth,et al.  Integrated Recognition and Interpretation of Speech for a Construction Task Domain , 1999, HCI.

[21]  Daniel Schlüter,et al.  Using Markov random fields for contour-based grouping , 1997, Proceedings of International Conference on Image Processing.

[22]  Katashi Nagao,et al.  Ubiquitous Talker: Spoken Language Interaction with Real World Objects , 1995, IJCAI.

[23]  Fausto Giunchiglia,et al.  NALIG: A CAD system for interior design with high level interaction capabilities , 1993, Proceedings of 1993 IEEE Conference on Tools with Al (TAI-93).

[24]  Sven J. Dickinson,et al.  Integrating qualitative and quantitative shape recovery , 1994, International Journal of Computer Vision.

[25]  Pietro Perona,et al.  Bayesian reasoning on qualitative descriptions from images and speech , 2000, Image Vis. Comput..

[26]  Carlo Strapparava,et al.  Dialogue Cohesion Sharing and Adjusting in an Enhanced Multimodal Environment , 1993, IJCAI.

[27]  Patrick Olivier,et al.  Automatic Depiction of Spatial Descriptions , 1994, AAAI.