Few-shot Semantic Image Synthesis with Class Affinity Transfer

Semantic image synthesis aims to generate photo realistic images given a semantic segmentation map. Despite much recent progress, training them still requires large datasets of images annotated with per-pixel label maps that are extremely tedious to obtain. To alleviate the high annotation cost, we propose a transfer method that leverages a model trained on a large source dataset to improve the learning ability on small target datasets via estimated pairwise relations between source and target classes. The class affinity matrix is introduced as a first layer to the source model to make it compatible with the target label maps, and the source model is then further finetuned for the target domain. To estimate the class affinities we consider different approaches to leverage prior knowledge: semantic segmentation on the source domain, textual label embeddings, and self-supervised vision features. We apply our approach to GAN-based and diffusion-based architectures for semantic synthesis. Our experiments show that the different ways to estimate class affinity can be effectively combined, and that our approach significantly improves over existing state-of-the-art transfer approaches for generative image models.

[1]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Dong Chen,et al.  Semantic Image Synthesis via Diffusion Models , 2022, ArXiv.

[3]  Jingkuan Song,et al.  Label-Guided Generative Adversarial Network for Realistic Image Synthesis , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Fang Wen,et al.  Pretraining is All You Need for Image-to-Image Translation , 2022, ArXiv.

[5]  Henghui Ding,et al.  A Closer Look at Few-shot Image Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[7]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[8]  Serge J. Belongie,et al.  Visual Prompt Tuning , 2022, ECCV.

[9]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[11]  Tao Kong,et al.  iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.

[12]  Yang Gao,et al.  LoFGAN: Fusing Local Representations for Few-shot Image Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[14]  Yong Jae Lee,et al.  Few-shot Image Generation via Cross-domain Correspondence , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yoshihiro Kanamori,et al.  Few-shot Semantic Image Synthesis Using StyleGAN Prior , 2021, ArXiv.

[16]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17]  Luc Van Gool,et al.  Efficient Conditional GAN Transfer with Knowledge Propagation across Classes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bernt Schiele,et al.  You Only Need Adversarial Supervision for Semantic Image Synthesis , 2020, ICLR.

[19]  Abhishek Kumar,et al.  Few-Shot Adaptation of Generative Adversarial Networks , 2020, ArXiv.

[20]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[22]  L. Carin,et al.  On Leveraging Pretrained GANs for Generation with Limited Data , 2020, ICML.

[23]  Jinwoo Shin,et al.  Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs , 2020, 2002.10964.

[24]  Joost van de Weijer,et al.  MineGAN: Effective Knowledge Transfer From GANs to Target Domains With Few Images , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Trevor Darrell,et al.  Semantic Bottleneck Scene Generation , 2019, ArXiv.

[26]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Nicu Sebe,et al.  Budget-Aware Adapters for Multi-Domain Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Tatsuya Harada,et al.  Image Generation From Small Datasets via Batch Statistics Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[31]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[32]  Bogdan Raducanu,et al.  Transferring GANs: generating images from limited data , 2018, ECCV.

[33]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[35]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[36]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[39]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[41]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[42]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[43]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[44]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.