Contrastive Syn-to-Real Generalization

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverages the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while promoting the diversity of feature embeddings as an inductive bias to improve generalization. In addition, we enhance the proposed CSG framework with attentional pooling (A-pool) to let the model focus on semantically important regions and further improve its generalization. We demonstrate the effectiveness of CSG on various synthetic training tasks, exhibiting state-of-the-art performance on zero-shot domain generalization.

[1]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[2]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[4]  Kate Saenko,et al.  VisDA: The Visual Domain Adaptation Challenge , 2017, ArXiv.

[5]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[6]  Luc Van Gool,et al.  ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[8]  Yongxin Yang,et al.  Deeper, Broader and Artier Domain Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Le Song,et al.  Learning towards Minimum Hyperspherical Energy , 2018, NeurIPS.

[13]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[14]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[16]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[19]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Cordelia Schmid,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[21]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Tianbao Yang,et al.  Learning Attributes Equals Multi-Source Domain Generalization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mohammad Norouzi,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[24]  Ting Chen,et al.  Robust Pre-Training by Adversarial Contrastive Learning , 2020, NeurIPS.

[25]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[26]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[28]  Xiaoou Tang,et al.  Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net , 2018, ECCV.

[29]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[30]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[32]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[33]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Alberto L. Sangiovanni-Vincentelli,et al.  Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Tomas Pfister,et al.  Learning from Simulated and Unsupervised Images through Adversarial Training , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[38]  Yi Yang,et al.  Self-produced Guidance for Weakly-supervised Object Localization , 2018, ECCV.

[39]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[40]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.