Does the human brain represent objects for recognition by storing a series of twodimensional snapshots, or are the object models, in some sense, three-dimensional analogs of the objects they represent? One way to address this question is to explore the ability of the human visual system to generalize recognition from familiar to novel views of three-dimensional objects. Three recently proposed theories of object recognition | viewpoint normalization or alignment of 3D models [Ullman, S. (1989) Cognition, 32, 193-254], linear combination of 2D views [Ullman, S. & Basri, R. (1990)], and view approximation [Poggio, T. & Edelman, S. (1990) Nature, 343, 263-266] | predict di erent patterns of generalization to novel views. We have exploited the con icting predictions to test the three theories directly, in a psychophysical experiment involving computer-generated 3D objects. Our results suggest that the human visual system is better described as recognizing these objects by 2D view interpolation than by alignment or other methods that rely on object-centered 3D models. 2 How does the human visual system represent objects for recognition? The experiments we describe address this question by testing the ability of human subjects (and of computer models instantiating particular theories of recognition) to generalize from familiar to unfamiliar views of novel objects. Since di erent theories predict di erent patterns of generalization according to the experimental conditions, this approach yields concrete evidence in favor of some of the theories, and contradicts others. Theories that rely on 3D object-centered representations The rst class of theories we have considered [1, 4, 5] represent objects by 3D models, encoded in a viewpoint-independent fashion. One such approach, recognition by alignment [1], compares the input image with the projection of a stored model after the two are brought into register. The transformation necessary to achieve this registration is computed by matching a small number of features in the image with the corresponding features in the model. The aligning transformation is computed separately for each of the models stored in the system. Recognition is declared for the model that ts the input most closely after the two are aligned, if the residual dissimilarity between them is small enough. The decision criterion for recognition in this case can be stated in the following simpli ed form: kPTX(3D) X(2D)k < (1) where T is the aligning transformation, P is a 3D ! 2D projection operator, and the norm k kmeasures the dissimilarity between the projection of the transformed 3D model X(3D) and the input image X(2D). Recognition decision is then made based on a comparison between the measured dissimilarity and a threshold . One may make a further distinction between full alignment that uses 3D models and attempts to compensate for 3D transformations of objects (such as 3 rotation in depth), and the alignment of pictorial descriptions that uses multiple views rather than a single object-centered representation. Speci cally ([1], p.228), the multiple-view version of alignment involves representation that is \view-dependent, since a number of di erent models of the same object from di erent viewing positions will be used," but at the same time \view-insensitive, since the di erences between views are partially compensated by the alignment process." Consequently, view-independent performance (e.g., low error rate for novel views) can be considered the central distinguishing feature of both versions of this theory. Visual systems that rely on alignment and other 3D approaches can in principle achieve near perfect recognition performance, provided that (i) the 3D models of the input objects are available, and (ii) the information needed to access the correct model is present in the image. We note that a similar behavior is predicted by those recognition theories that represent objects by 3D structural relationships between generic volumetric primitives. Theories belonging to this class (e.g., [6, 7]) tend to focus on basic-level classi cation of objects rather than on the recognition of speci c object instances,1 and will not be given further consideration in this paper. Theories that rely on 2D viewer-centered representations Two recently proposed approaches to recognition dispense with the need for storing 3D models. The rst of these, recognition by linear combination of views [2], is built on the mathematical observation that, under orthographic projection, the 1Numerous studies in cognitive science (see [8] for a review) reveal that in the hierarchical structure of object categories there exists a certain level, called basic level, which is the most salient according to a variety of criteria (such as the ease and preference of access). Taking as an example the hierarchy \quadruped, mammal, cat, Siamese", the basic level is that of \cat". Objects whose recognition implies more detailed distinctions than those required for basic-level categorization are said to belong to a subordinate level. 4 2D coordinates of an object point can be represented by a linear combination of the coordinates of the corresponding points in a small number of xed 2D views of the same object. The required number of views depends on the allowed 3D transformations of the objects and on the representation of an individual view. A polyhedral object that can undergo a general linear transformation requires three views if separate linear bases are used to represent the x and the y coordinates of a new view; two views su ce if a mixed x; y basis is used [2, 9]. The recognition criterion under one possible version of the linear combination approach [10] can be formulated schematically as kXi iX(2D) i X(2D)k < (2) where the stored views X(2D) i comprise the linear vector basis that represents an object model (i.e., spans the space of the object's views), X(2D) is the input image, and i are the coe cients estimated for the given model/image pair. A recognition system that is perfectly linear and relies exclusively on the above approach should achieve uniformly high performance on those views that fall within the space spanned by the stored set of model views, and should perform poorly on views that belong to an orthogonal space. Another approach that represents objects by sets of 2D views is view approximation by regularization networks [3, 11], which includes as a special case approximation by radial basis functions (RBFs) [12, 13]. In this approach, generalization from familiar to novel views is regarded as a problem of approximating a smooth hypersurface in the space of all possible views, with the \height" of the surface known only at a sparse set of points corresponding to the familiar views. The approximation can be performed by a two-stage network (see [9] for details). In the rst stage intermediate responses are formed by a collection of nonlinear \receptive elds" (shaped, e.g., as multidimensional Gaussians), centered at the 5 familiar views. The output of the second stage is a linear combination of the intermediate receptive eld responses. If the regularization network is trained to output the value 1 for various views of a given object, the decision criterion for recognition can be stated as jXk ckG kX(2D) X(2D) k k 1j < (3) where X(2D) is the input image, X(2D) k are the familiar or prototypical views stored in the system, ck are the linear coe cients, and the function G( ) represents the shape of the receptive eld. A recognition system based on this method is expected to perform well when the novel view is close to the stored ones (that is, when most of the features of the input image fall close to their counterparts at least in some of the stored views; cf. [14]). The performance should become progressively worse on views that are far from the familiar ones. Methods To distinguish between the theories outlined above, we have developed an experimental paradigm based on a two-alternative forced-choice (2AFC) task. Our experiments consist of two phases: training and testing. In the training phase subjects are shown a novel object (see Figure 1) de ned as the target, usually as a motion sequence of 2D views that leads to an impression of solid shape through the kinetic depth e ect. In the testing phase the subjects are presented with single static views of either the target or a distractor (one of a relatively large set of similar objects). Target test views were situated either on the equator (on the 0 75 or on the 75 360 portion of the great circle, called inter and extra conditions), or on the meridian passing through one of the training views (ortho condition) (see Figure 2). The subject's task was to press a \yes-button" 6 if the displayed object is the current target and a \no-button" otherwise, and to do it as quickly and as accurately as possible. These instructions usually resulted in mean response times around 1 sec, and in mean miss rates2 around 30%. The fast response times indicate that the subjects did not apply conscious problemsolving techniques or reason explicitly about the stimuli. In all our experiments the subjects received no feedback as to the correctness of their response. The main features of our experimental approach are as follows: We can control precisely the subject's prior exposure to the targets, by employing novel computer-generated three-dimensional objects, similar to those shown in Figure 1. We can generate an unlimited number of novel objects with controlled complexity and surface appearance. Because the stimuli are produced by computer graphics, we can conduct identical experiments with human subjects and with computational models. Results The experimental setup satis ed both requirements of the alignment theory for perfect recognition: the subjects, all of whom reported perfect perception of 3D structure from motion during training, had the opportunity to form 3D models of the stimuli, and al
[1]
Wayne D. Gray,et al.
Basic objects in natural categories
,
1976,
Cognitive Psychology.
[2]
Irving Biederman,et al.
Human image understanding: Recent research and a theory
,
1985,
Comput. Vis. Graph. Image Process..
[3]
D. W. Thompson,et al.
Three-dimensional model matching from an unconstrained viewpoint
,
1987,
Proceedings. 1987 IEEE International Conference on Robotics and Automation.
[4]
I. Rock,et al.
A case of viewer-centered object perception
,
1987,
Cognitive Psychology.
[5]
David S. Broomhead,et al.
Multivariable Functional Interpolation and Adaptive Networks
,
1988,
Complex Syst..
[6]
S. Ullman.
Aligning pictorial descriptions: An approach to object recognition
,
1989,
Cognition.
[7]
John Moody,et al.
Fast Learning in Networks of Locally-Tuned Processing Units
,
1989,
Neural Computation.
[8]
S. Edelman,et al.
Stimulus Familiarity Determines Recognition Strategy for Novel 3D Objects
,
1989
.
[9]
I. Rock,et al.
Can we imagine how objects look from other viewpoints?
,
1989,
Cognitive Psychology.
[10]
S. Edelman,et al.
Generalization of object recognition in human vision across stimulus transformations and deformations
,
1990
.
[11]
T Poggio,et al.
Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks
,
1990,
Science.
[12]
T. Poggio,et al.
A network that learns to recognize three-dimensional objects
,
1990,
Nature.
[13]
Ronen Basri,et al.
Recognition by Linear Combinations of Models
,
1991,
IEEE Trans. Pattern Anal. Mach. Intell..
[14]
Shimon Edelman,et al.
Bringing the Grandmother back into the Picture: A Memory-Based View of Object Recognition
,
1990,
Int. J. Pattern Recognit. Artif. Intell..
[15]
Daphna Weinshall,et al.
A Model of the Acquisition of Object Representations in Human 3D Visual Recognition
,
1993
.