Learning sparse representations of three-dimensional objects

Each object in our environment can cause considerably different patterns of excitation in our retinae depending on the observed viewpoint of the object. Despite this we are able to perceive that the changing signals are produced by the same object. It is a function of our brain to provide this constant recognition from such inconstant input signals by establishing an internal representation of the object. The nature of such a viewpoint-invariant representation, the way how it can be acquired, and its application in a perception task are the concern of this work. We describe the generation of view-based, sparse representations of real-world objects and apply them in a pose estimation task. 1 What can we Learn from the Brain? There are uncountable behavioral studies with primates that support the model of a view-based description of three-dimensional objects by our visual system. If a set of unfamiliar object views is presented to humans their response time and error rates during recognition increase with increasing angular distance between the learned (i.e., stored) and the unfamiliar view [4]. This angle effect declines if intermediate views are experienced and stored [11]. The performance is not linearly dependent on the shortest angular distance in three dimensions to the best-recognized view, but it correlates with an “image-plane feature-byfeature deformation distance” between the test view and the best-recognized view [2]. Thus, measurement of image-plane similarity to a few feature patterns seems to be an appropriate model for human three-dimensional object recognition. Experiments with monkeys show that familiarization with a “limited number” of views of a novel object can provide viewpoint-independent recognition [6]. Numerous physiological studies also give evidence for a view-based processing of the brain during object recognition. Results of recordings of single neurons in the inferior temporal cortex (IT) of monkeys, which is known to be concerned with object recognition, resemble those obtained by the behavioral studies. Populations of IT neurons have been found which respond selectively to only some views of an object and their response declines as the object is rotated away from the preferred view [7]. Summarizing, one can say that object representations in form of single, but connected views seem to be sufficient for a huge variety of situations and perception tasks. In sections 2 and 3 we introduce our approach of learning an object representation which takes these results about primate brain functions into account. We automatically generate sparse representations for real-world objects, which satisfy the following conditions: (a1) They are constituted from two-dimensional views. (a2) They are sparse, i.e., they consist of as few views as possible. (a3) They are capable of performing perception tasks. The last condition is verified in section 4, where we apply our representations to estimate poses of objects.