Exploring the Relationship Between Context and Pose : A Case Study

While context has received little attention in the visual object classification literature, it nevertheless plays a vital role in the ability to identify objects in a scene. This paper seeks to improve the performance of object classifiers by incorporating contextual information. Our method uses probability maps to guide classifiers to image regions likely to contain the object in question, based on the object's past positions and the positions of surrounding objects. We contrast our method with a baseline unguided classifier and show that using probability maps as a preprocessing step significantly reduces the number of positions a classifier needs to evaluate. The structures presented here can be used with any classification algorithm that evaluates windowed image regions. Introduction The ability to automatically identify objects in an image is one of the fundamental topics driving computer vision research. While classifiers have demonstrated remarkable success in constrained domains such as faces ([Viola], [Schniderman], [Gu], [Heisele], [Hjelmas], [Yang]), their performance still lags far behind human capabilities when given arbitrary scenes. We believe one reason for this discrepancy is due to the lack of context in the traditional classification framework. Until recently, computer vision has drawn heavily from signal theory. In general, classifiers have relied on local, low-level image statistics without considering surrounding objects. Natural images (i.e., those without content constraints) have a great deal of context which can be used to infer particular characteristics of objects in the scene. When multiple classifiers are run sequentially on the same image, they operate independently of one another. Intuitively however, we know that certain objects tend to be present together (e.g., spoons and forks, or tables and chairs), and are linked by well-defined spatial relationships. In this paper we propose a framework for incorporating knowledge of previously detected objects into object classifiers. Background The general structure for an object classifier is as follows: In the learning phase, a collection of positive and negative training examples is used to generate a series of representative features. Features can consist of values taken directly from an image (such as a histogram of pixel intensities) or an abstraction (such as Haar-like wavelets [Papageorgiou]). During the test phase, a subwindow is selected from the query image. This subwindowed region is assigned a score in proportion to its correlation with the learned features. This region is then shifted and the process is repeated until the entire image has been evaluated. Image regions with a score above a given threshold are said to contain the object in question. The problem with this sort of exhaustive search is twofold: 1) When the distribution of objects in an image is known to be non-uniform, searching in low-probability regions is extraneous. For applications requiring multiple classifiers or applications using low-throughput, embedded hardware, classification speed rapidly becomes a computational bottleneck. 2) An exhaustive search creates opportunities for false detections in regions that are unlikely to contain the object of interest. An exploration of the biological basis of object detection in [Itti] proposes that humans employ a hybrid bottom-up/top-down attention model. Input images are broken down into feature maps which encode low-level observations such as color, intensity, and gradient orientation. The feature maps are then combined into a single saliency map. In parallel, a database of acquired knowledge directs the focus towards areas that provide maximum information gain. Empirical results from [Davenport] find that humans can identify objects more accurately when placed in a semantically consistent setting. In functional MRI (fMRI) studies, cortical regions were found that regulate processing of contextual relationships between pairs of objects [Bar]. Recent implementations inspired by physiological models have focused on the attention mechanism. [Gould] presents a system that combines wide-angle and telephoto cameras to simulate peripheral and foveal vision. A fixed, wideangle camera provides a low-resolution overview of the scene, while a PTZ-mounted telephoto lens hops between regions of interest. [Orabona] combines feature maps to form “salient regions” which detect proto-objects for intelligently guided object classifiers. [Torralba] describes a method to improve classification accuracy which models three terms: object appearance, object spatial distribution, and the likelihood of an object given a particular scene category. The first of these corresponds to the wellknown low-level features, while the latter two comprise the higher-order knowledge. The three terms are multiplied to form a single probability function. [Hotz] uses a different approach to incorporate context. Lowand high-level analyses are performed separately and communicate through a feedback loop. A hypothesis module generates predictions on object positions, which is passed into an AdaBoost-trained classifier. These results are in turn used to update the hypothesis. Approach The scanning method used by the majority of classifiers calls for sliding the subwindow through the image in a linear (left-to-right, top-to-bottom) pattern. As the two problems outlined above suggest, the order that image regions are evaluated could instead be guided by the spatial distributions of the objects being searched for. Scanning the most likely regions first allows the classifier to simultaneously reduce the number of evaluations and the number of false positives. To accomplish this we guide classifiers with the aid of probability maps, two-dimensional structures that encode the likelihood of finding an object at a given image coordinate. After computing the maps, an ordered list of image coordinates is generated, sorted by probability. The classifier first evaluates a subwindow in the most probable coordinate and works down the list, possibly terminating early if the probability drops below a predetermined threshold. The motivation for incorporating object statistics as a preprocessing step in the form of maps is to leverage the existing power of classifiers. Since our method makes use of higher-order knowledge, it is natural to apply this contextual information as a layer above the low-level classifier, rather than proposing a new, monolithic algorithm. A corollary benefit is that our method works independently of the underlying classification algorithm and thus can remain useful as more powerful algorithms are developed. Object-based Probability Maps In its simplest form, a probability map for an object o is merely the PDF of the object's spatial location. We call this function an object-based probability map, since it is dependent only on statistics of the object alone, without considering additional contextual cues. This distribution is modelled as a sum of two-dimensional conditional Gaussians distributions:

[1]  Stan Z. Li,et al.  Learning probabilistic distribution model for multi-view face detection , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[3]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Bernd Neumann,et al.  Feedback between Low-level and High-level Image Processing , 2007 .

[5]  Thomas Serre,et al.  Component-based face detection , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  Giulio Sandini,et al.  Object-based Visual Attention: a Model for a Behaving Robot , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[8]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Neil Gershenfeld,et al.  The nature of mathematical modeling , 1998 .

[10]  Antonio Torralba,et al.  Contextual Modulation of Target Saliency , 2001, NIPS.

[11]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[12]  Gary R. Bradski,et al.  Peripheral-Foveal Vision for Real-time Object Recognition and Tracking in Video , 2007, IJCAI.

[13]  Tomaso A. Poggio,et al.  A general framework for object detection , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[14]  Takeo Kanade,et al.  A statistical method for 3D object detection applied to faces and cars , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).