Unsupervised Visual Representation Learning by Context Prediction

This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework [19] and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

[1]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[2]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[3]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[4]  Pietro Perona,et al.  Unsupervised Learning of Models for Recognition , 2000, ECCV.

[5]  Edward H. Adelson,et al.  On seeing stuff: the perception of materials by humans and machines , 2001, IS&T/SPIE Electronic Imaging.

[6]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[7]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[8]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Trevor Darrell,et al.  Unsupervised Learning of Categories from Sets of Partially Matching Image Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[14]  Jun'ichi Tsujii,et al.  A discriminative language model with pseudo-negative samples , 2007, ACL.

[15]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Christos Faloutsos,et al.  Unsupervised modeling of object categories using link analysis techniques , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Yiannis Aloimonos,et al.  Who killed the directed model? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Luc Van Gool,et al.  World-scale mining of objects and events from community photo collections , 2008, CIVR '08.

[21]  Stephen Parkinson,et al.  A Treatise on Optics , 2008 .

[22]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[23]  Yong Jae Lee,et al.  Foreground Focus: Unsupervised Learning from Partially Matching Images , 2009, International Journal of Computer Vision.

[24]  O. Chum,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Alexei A. Efros,et al.  Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships , 2009, NIPS.

[26]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[27]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[28]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.

[29]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Sinisa Todorovic,et al.  From a Set of Shapes to Object Discovery , 2010, ECCV.

[31]  Leonidas J. Guibas,et al.  Image webs: Computing and exploiting connectivity in image collections , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[33]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[36]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[37]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[38]  Zhuowen Tu,et al.  Harvesting Mid-level Visual Concepts from Large-Scale Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Segmentation Driven Object Detection with Fisher Vectors , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[43]  Derek Hoiem,et al.  Learning Collections of Part Models for Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[45]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[46]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  Ming Yang,et al.  Regionlets for Generic Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[50]  Alexei A. Efros,et al.  Context as Supervisory Signal: Discovering Objects with Predictable Context , 2014, ECCV.

[51]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[52]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[53]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[54]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[55]  Björn Ommer,et al.  Randomized Max-Margin Compositions for Visual Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Michal Irani,et al.  “Clustering by Composition”—Unsupervised Discovery of Image Categories , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[58]  Frank Dellaert,et al.  Dataset fingerprints: Exploring image collections through data mining , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[60]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[61]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[62]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[63]  Matthias Bethge,et al.  Generative Image Modeling Using Spatial LSTMs , 2015, NIPS.

[64]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[65]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.