Using Speech in Visual Object Recognition

Automatic understanding of multi-modal input is the central topic in modern human Computer interfaces. But the basic questions about how the interpretations provided by different modalities can be connected in a universal and robust manner is still an open problem. The most intuitive input modalities, speech perception and vision, can only be correlated on a qualitative content based interpretation level. But, due to vague meanings and erroneous processing results this is extremely difficult to accomplish. A simple frame based integration scheme filling appropriate slots with new analysis results will fail when ambiguous or contradictory information appears. In this paper we propose a new probabilistic framework to overcome these drawbacks. The integration model is built up from data collected in labeled test sets and psycholinguistic experiments. Thereby, the correspondence problem is solved in a very robust and universal manner. In particular, we will show that erroneous visual interpretations can be corrected by a joint analysis of visual and speech input data.

[1]  Alex Waibel,et al.  Multimodal interfaces for multimedia information agents , 1997 .

[2]  Franz Kummert,et al.  Hybrid object recognition in image sequences , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[3]  Sven Wachsmuth,et al.  Integration of parsing and incremental speech recognition , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[4]  Katashi Nagao,et al.  Ubiquitous Talker: Spoken Language Interaction with Real World Objects , 1995, IJCAI.

[5]  R. Weller Two cortical visual systems in Old World and New World primates. , 1988, Progress in brain research.

[6]  Sven Wachsmuth,et al.  Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks , 1999, ICVS.

[7]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[8]  Gernot A. Fink Developing HMM-Based Recognizers with ESMERALDA , 1999, TSD.

[9]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[10]  Alexander H. Waibel,et al.  Interactive recovery from speech recognition errors in speech user interfaces , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Ray Jackendo,et al.  Languages of the Mind , 1992 .

[12]  J. Carlos Languages of the Mind , 1995 .

[13]  H. Niemann,et al.  Knowledge based image and speech analysis for service robots , 1999, Proceedings Integration of Speech and Image Understanding.

[14]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[15]  Debra T. Burhans,et al.  Visual Semantics: Extracting Visual information from Text Accompanying Pictures , 1994, AAAI.

[16]  Heinrich Niemann,et al.  Control and explanation in a signal understanding environment , 1993, Signal Process..

[17]  Pietro Perona,et al.  Bayesian reasoning on qualitative descriptions from images and speech , 2000, Image Vis. Comput..

[18]  Patrick Oliver,et al.  Representation and Processing of Spatial Expressions , 1998 .

[19]  Sven Wachsmuth,et al.  Integrated Recognition and Interpretation of Speech for a Construction Task Domain , 1999, HCI.

[20]  Franz Kummert,et al.  Modeling and recognition of assembled objects , 1998, IECON '98. Proceedings of the 24th Annual Conference of the IEEE Industrial Electronics Society (Cat. No.98CH36200).

[21]  Leslie G. Ungerleider Two cortical visual systems , 1982 .