Learning words from natural audio-visual input

We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera and microphone input. Although simple in its current form, this model is a rst step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-machine interfaces for information browsing, assistive technologies, education, and entertainment.