Hard Negative Mixing for Contrastive Learning

Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies either at the image or the feature level improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing the memory size, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.

[1]  Ivan Laptev,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[3]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[4]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Mike Wu,et al.  On Mutual Information in Contrastive Learning for Visual Representations , 2020, ArXiv.

[6]  Martial Hebert,et al.  Unsupervised Learning of Video Representations via Dense Trajectory Clustering , 2020, ECCV Workshops.

[7]  Rynson W.H. Lau,et al.  What makes instance discrimination good for transfer learning? , 2020, ArXiv.

[8]  Ethan Dyer,et al.  Affinity and Diversity: Quantifying Mechanisms of Data Augmentation , 2020, ArXiv.

[9]  Yuwen Xiong,et al.  LoCo: Local Contrastive Representation Learning , 2020, NeurIPS.

[10]  Trevor Darrell,et al.  Rethinking Image Mixture for Unsupervised Visual Representation Learning , 2020, ArXiv.

[11]  Yue Wu,et al.  Demystifying Self-Supervised Learning: An Information-Theoretical Framework , 2020, ArXiv.

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[14]  Chen Wang,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[15]  Marios Savvides,et al.  Attentive Cutmix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Luc Van Gool,et al.  SCAN: Learning to Classify Images Without Labels , 2020, ECCV.

[17]  Brenden M. Lake,et al.  Self-supervised learning through the eyes of a child , 2020, NeurIPS.

[18]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[20]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[21]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[22]  Abhinav Gupta,et al.  ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chen Change Loy,et al.  Delving into Inter-Image Invariance for Unsupervised Visual Representations , 2020, ArXiv.

[24]  Bin Liu,et al.  Parametric Instance Classification for Unsupervised Visual Feature Learning , 2020, NeurIPS.

[25]  Wei Shen,et al.  CO2: Consistent Contrast for Unsupervised Visual Representation Learning , 2020, ICLR.

[26]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[29]  Geoffrey Zweig,et al.  On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jianbo Jiao,et al.  Self-supervised Video Representation Learning by Pace Prediction , 2020, ECCV.

[31]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  C. V. Jawahar,et al.  Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chen Sun,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[34]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[37]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[38]  Jiwen Lu,et al.  Hardness-Aware Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[41]  R. Devon Hjelm,et al.  Representation Learning with Video Deep InfoMax , 2020, ArXiv.

[42]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[43]  Iasonas Kokkinos,et al.  Discriminative Learning of Deep Convolutional Feature Point Descriptors , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[45]  Toshihiko Yamasaki,et al.  Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework , 2020, ACM Multimedia.

[46]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[47]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[48]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[49]  Gustavo Carneiro,et al.  Smart Mining for Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[51]  Chengxu Zhuang,et al.  Local Aggregation for Unsupervised Learning of Visual Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[53]  Xinlei Chen,et al.  Understanding Self-supervised Learning with Dual Deep Networks , 2020, ArXiv.

[54]  Shuang Yu,et al.  Comparing to Learn: Surpassing ImageNet Pretraining on Radiographs By Comparing Image Representations , 2020, MICCAI.

[55]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Julien Mairal,et al.  Unsupervised Pre-Training of Image Features on Non-Curated Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Xiaotong Liu,et al.  Hard negative examples are hard, but useful , 2020, ECCV.

[58]  Yu Wang,et al.  Joint Contrastive Learning with Infinite Possibilities , 2020, NeurIPS.

[59]  Chen Change Loy,et al.  Online Deep Clustering for Unsupervised Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Nuno Vasconcelos,et al.  Contrastive Learning with Adversarial Examples , 2020, NeurIPS.

[61]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[62]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[63]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[64]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[66]  Hongkai Xiong,et al.  K-Shot Contrastive Learning of Visual Features with Multiple Instance Augmentations , 2020, ArXiv.

[67]  Jason D. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, ArXiv.

[68]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[69]  Stefano Ermon,et al.  Multi-label Contrastive Predictive Coding , 2020, NeurIPS.

[70]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[71]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[72]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Kyunghyun Cho,et al.  A Framework For Contrastive Self-Supervised Learning And Designing A New Approach , 2020, ArXiv.

[74]  Zhongming Jin,et al.  Deep Robust Clustering by Contrastive Learning , 2020, ArXiv.

[75]  Ching-Yao Chuang,et al.  Debiased Contrastive Learning , 2020, NeurIPS.

[76]  Sung Ju Hwang,et al.  Adversarial Self-Supervised Contrastive Learning , 2020, NeurIPS.

[77]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[78]  Ruize Wang,et al.  Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning , 2020, ACM Multimedia.

[79]  Sindy Löwe,et al.  Putting An End to End-to-End: Gradient-Isolated Learning of Representations , 2019, NeurIPS.

[80]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[81]  Andrea Vedaldi,et al.  Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.

[82]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[83]  Andrew Zisserman,et al.  LSD-C: Linearly Separable Deep Clusters , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[84]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[85]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[86]  Hao Liu,et al.  Hybrid Discriminative-Generative Training via Contrastive Learning , 2020, ArXiv.

[87]  Geonmo Gu,et al.  Embedding Expansion: Augmentation in Embedding Space for Deep Metric Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[89]  Xudong Lin,et al.  Deep Adversarial Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[90]  Yannis Avrithis,et al.  Mining on Manifolds: Metric Learning Without Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91]  Alexander Kolesnikov,et al.  Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[93]  Amos Storkey,et al.  Self-Supervised Relational Reasoning for Representation Learning , 2020, NeurIPS.

[94]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ArXiv.

[95]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[96]  Jiri Matas,et al.  Working hard to know your neighbor's margins: Local descriptor learning loss , 2017, NIPS.

[97]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Andrew Zisserman,et al.  Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.