MPG: A Multi-ingredient Pizza Image Generator with Conditional StyleGANs

Multilabel conditional image generation is a challenging problem in computer vision. In this work we propose Multi-ingredient Pizza Generator (MPG), a conditional Generative Neural Network (GAN) framework for synthesizing multilabel images. We design MPG based on a state-of-the-art GAN structure called StyleGAN2, in which we develop a new conditioning technique by enforcing intermediate feature maps to learn scalewise label information. Because of the complex nature of the multilabel image generation problem, we also regularize synthetic image by predicting the corresponding ingredients as well as encourage the discriminator to distinguish between matched image and mismatched image. To verify the efficacy of MPG, we test it on Pizza10, which is a carefully annotated multi-ingredient pizza image dataset. MPG can successfully generate photo-realist pizza images with desired ingredients. The framework can be easily extend to other multilabel image generation scenarios.

[1]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[3]  Antonio Torralba,et al.  How to Make a Pizza: Learning a Compositional Layer-Based GAN Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[5]  Jaakko Lehtinen,et al.  Training Generative Adversarial Networks with Limited Data , 2020, NeurIPS.

[6]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[8]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[10]  Gerasimos Spanakis,et al.  LoGANv2: Conditional Style-Based Logo Generation with Generative Adversarial Networks , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[11]  Harris Drucker,et al.  Improving generalization performance using double backpropagation , 1992, IEEE Trans. Neural Networks.

[12]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[13]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[16]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[17]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Andrew Slavin Ross,et al.  Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients , 2017, AAAI.

[19]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[20]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[22]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[23]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Honghao Gao,et al.  A Food Dish Image Generation Framework Based on Progressive Growing GANs , 2019, CollaborateCom.

[27]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[28]  Wataru Shimoda,et al.  Food image generation using a large amount of food images with conditional GAN: ramenGAN and recipeGAN , 2018, MADiMa@IJCAI.

[29]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[30]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[31]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[32]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[33]  Vladimir Pavlovic,et al.  CookGAN: Meal Image Synthesis from Ingredients , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[35]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ori Bar El,et al.  GILT: Generating Images from Long Text , 2019, ArXiv.

[38]  Shiguang Shan,et al.  AttGAN: Facial Attribute Editing by Only Changing What You Want , 2017, IEEE Transactions on Image Processing.

[39]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[41]  Steven C. H. Hoi,et al.  Learning Cross-Modal Embeddings With Adversarial Networks for Cooking Recipes and Food Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).