A Controlling Strategy for an Active Vision System Based on Auditory and Visual Cues

It is still an open question how preliminary visual reflexes can be structured by auditory and visual modalities in order to recognize objects. Therefore, we propose a new method for a controlling strategy for an active vision system that learns to focus on relevant multi modal aspects of the environment. The method is bootstrapped by a bottom up visual saliency process in order to extract important visual points. In this paper, we present our first results and focus on the unsupervised generation of training data for a multi-modal object recognition. The performance is compared to a human evaluated database.