Object localization in ImageNet by looking out of the window

We propose a method for annotating the location of objects in ImageNet. Traditionally, this is cast as an image window classification problem, where each window is considered independently and scored based on its appearance alone. Instead, we propose a method which scores each candidate window in the context of all other windows in the image, taking into account their similarity in appearance space as well as their spatial relations in the image plane. We devise a fast and exact procedure to optimize our scoring function over all candidate windows in an image, and we learn its parameters using structured output regression. We demonstrate on 92000 images from ImageNet that this significantly improves localization over recent techniques that score windows in isolation.

[1]  Andrew Zisserman,et al.  Structured output regression for detection with partial truncation , 2009, NIPS.

[2]  Qiang Chen,et al.  Contextualizing Object Detection and Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[4]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Christoph H. Lampert,et al.  Learning to Localize Objects with Structured Output Regression , 2008, ECCV.

[6]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[9]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[11]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[15]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[16]  Luc Van Gool,et al.  Branch&Rank: Non-Linear Object Detection , 2011, BMVC.

[17]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  Matthieu Guillaumin,et al.  Large-scale knowledge transfer for object localization in ImageNet , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Santiago Manen,et al.  Prime Object Proposals with Randomized Prim's Algorithm , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Vittorio Ferrari,et al.  Associative Embeddings for Large-Scale Knowledge Transfer with Self-Assessment , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Antonio Torralba,et al.  Object Recognition by Scene Alignment , 2007, NIPS.

[26]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[27]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[29]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[30]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[31]  Luc Van Gool,et al.  Coupled Detection and Trajectory Estimation for Multi-Object Tracking , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Ming Yang,et al.  Regionlets for Generic Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[35]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Matthieu Guillaumin,et al.  ImageNet Auto-Annotation with Segmentation Propagation , 2014, International Journal of Computer Vision.

[37]  Tsuhan Chen,et al.  Extracting adaptive contextual cues from unlabeled regions , 2011, 2011 International Conference on Computer Vision.