Learning spoken words from multisensory input

Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. We present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic (two-person) conversation. It has been noticed that the listener always has a strong tendency to look toward objects referred to by the speaker during the conversation. In light of this, we propose to use eye gaze to integrate acoustic and visual signals, and build the audio-visual lexicons of objects. With such data gathered from conversations in different languages, the spoken names of objects in different languages can be translated based on their visual semantics. We have developed a multimodal learning system and report the results of experiments using speech, video in concert with eye movement records as training data.