Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning

Learning image representations without human supervision is an important and active research field. Several recent approaches have successfully leveraged the idea of making such a representation invariant under different types of perturbations, especially via contrastive-based instance discrimination training. Although effective visual representations should indeed exhibit such invariances, there are other important characteristics, such as encoding contextual reasoning skills, for which alternative reconstruction-based approaches might be better suited. With this in mind, we propose a teacher-student scheme to learn representations by training a convnet to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image. Our strategy performs an online training of both the teacher network (whose role is to generate the BoW targets) and the student network (whose role is to learn representations), along with an online update of the visual-words vocabulary (used for the BoW targets). This idea effectively enables fully online BoW-guided unsupervised learning. Extensive experiments demonstrate the interest of our BoW-based strategy which surpasses previous state-of-theart methods (including contrastive-based ones) in several applications. For instance, in downstream tasks such Pascal object detection, Pascal classification and Places205 classification, our method improves over all prior unsupervised approaches, thus establishing new state-of-the-art results that are also significantly better even than those of supervised pre-training. We provide the implementation code at https://github.com/valeoai/obow.

[1]  Jürgen Schmidhuber,et al.  Evolving Modular Fast-Weight Networks for Control , 2005, ICANN.

[2]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[3]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[4]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[5]  George Papamakarios,et al.  Distilling Model Knowledge , 2015, ArXiv.

[6]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ArXiv.

[7]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[8]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[9]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[10]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  David Saad,et al.  Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks , 1995, NIPS.

[13]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[14]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Julien Mairal,et al.  Unsupervised Pre-Training of Image Features on Non-Curated Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[18]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[19]  Wei Shen,et al.  Few-Shot Image Recognition by Predicting Parameters from Activations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[21]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sergio Guadarrama,et al.  Tracking Emerges by Colorizing Videos , 2018, ECCV.

[23]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[24]  Yannis Kalantidis,et al.  Hard Negative Mixing for Contrastive Learning , 2020, NeurIPS.

[25]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[26]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[28]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[29]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[30]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[31]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Matthieu Cord,et al.  The Missing Data Encoder: Cross-Channel Image Completion\\with Hide-And-Seek Adversarial Network , 2019, AAAI.

[33]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[34]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[36]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[37]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[39]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Fillia Makedon,et al.  A Survey on Contrastive Self-supervised Learning , 2020, Technologies.

[41]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[43]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[44]  Xiaohua Zhai,et al.  Self-Supervised GANs via Auxiliary Rotation Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiebo Luo,et al.  AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  The Visual Centrifuge: Model-Free Layered Video Representations , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Chen Sun,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[48]  Kyunghyun Cho,et al.  A Framework For Contrastive Self-Supervised Learning And Designing A New Approach , 2020, ArXiv.

[49]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[50]  Matthieu Cord,et al.  QUEST: Quantized embedding space for transferring knowledge , 2019, ECCV.

[51]  David J. Schwab,et al.  Are all negatives created equal in contrastive instance discrimination? , 2020, ArXiv.

[52]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[53]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Matthieu Cord,et al.  Learning Representations by Predicting Bags of Visual Words , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Dahua Lin,et al.  Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , 2018, ArXiv.

[57]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[58]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Gregory Shakhnarovich,et al.  Learning Representations for Automatic Colorization , 2016, ECCV.

[60]  Jeff Donahue,et al.  Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[61]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[62]  Quoc V. Le,et al.  Selfie: Self-supervised Pretraining for Image Embedding , 2019, ArXiv.

[63]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[64]  E. Gardner,et al.  Three unfinished works on the optimal storage capacity of networks , 1989 .

[65]  Rares Ambrus,et al.  SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[66]  Yannis Avrithis,et al.  To Aggregate or Not to aggregate: Selective Match Kernels for Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[67]  Andrew Zisserman,et al.  Video Google: Efficient Visual Search of Videos , 2006, Toward Category-Level Object Recognition.

[68]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[71]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[73]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[74]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[75]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[76]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[77]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[78]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[80]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[81]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[82]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.