A trainable spoken language understanding system for visual object selection

We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visually-grounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description.

[1]  Matti Pietikäinen,et al.  Color segmentation by hierarchical connected components analysis with image enhancement by symmetric neighborhood filters , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[2]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[3]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[4]  Deb K. Roy,et al.  Learning visually grounded words and syntax for a scene description task , 2002, Comput. Speech Lang..

[5]  Harriet J. Nock,et al.  Semantic indexing of multimedia using audio, text and visual cues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.