Self-Supervised Ranking for Representation Learning

We present a new framework for self-supervised representation learning by formulating it as a ranking problem in an image retrieval context on a large number of random views (augmentations) obtained from images. Our work is based on two intuitions: first, a good representation of images must yield a high-quality image ranking in a retrieval task; second, we would expect random views of an image to be ranked closer to a reference view of that image than random views of other images. Hence, we model representation learning as a learning to rank problem for image retrieval. We train a representation encoder by maximizing average precision (AP) for ranking, where random views of an image are considered positively related, and that of the other images considered negatives. The new framework, dubbed S2R2, enables computing a global objective on multiple views, compared to the local objective in the popular contrastive learning framework, which is calculated on pairs of views. In principle, by using a ranking criterion, we eliminate reliance on object-centric curated datasets. When trained on STL10 and MS-COCO, S2R2 outperforms SimCLR and the clustering-based contrastive learning model, SwAV, while being much simpler both conceptually and at implementation. On MS-COCO, S2R2 outperforms both SwAV and SimCLR with a larger margin than on STl10. This indicates that S2R2 is more effective on diverse scenes and could eliminate the need for an object-centric large training dataset for self-supervised representation learning.

[1]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ArXiv.

[3]  Tinne Tuytelaars,et al.  MIX'EM: Unsupervised Image Classification using a Mixture of Embeddings , 2020, ACCV.

[4]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[5]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[6]  Tao Qin,et al.  A general approximation framework for direct optimization of information retrieval measures , 2010, Information Retrieval.

[7]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Andrew Zisserman,et al.  Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval , 2020, ECCV.

[12]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[13]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[15]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[16]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[17]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[19]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.