Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes

Recent advances in Deep Learning and probabilistic modeling have led to strong improvements in generative models for images. On the one hand, Generative Adversarial Networks (GANs) have contributed a highly effective adversarial learning procedure, but still suffer from stability issues. On the other hand, Conditional Variational Auto-Encoders (CVAE) models provide a sound way of conditional modeling but suffer from mode-mixing issues. Therefore, recent work has turned back to simple and stable regression models that are effective at generation but give up on the sampling mechanism and the latent code representation. We propose a novel and efficient stochastic regression approach with latent drop-out codes that combines the merits of both lines of research. In addition, a new training objective increases coverage of the training distribution leading to improvements over the state of the art in terms of accuracy as well as diversity.

[1]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  Harry Shum,et al.  Face Hallucination: Theory and Practice , 2007, International Journal of Computer Vision.

[5]  Pushmeet Kohli,et al.  Multiple Choice Learning: Learning to Produce Multiple Structured Outputs , 2012, NIPS.

[6]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[14]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[18]  Alexei A. Efros,et al.  Image quilting for texture synthesis and transfer , 2001, SIGGRAPH.

[19]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[21]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[22]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Yaser Sheikh,et al.  PixelNN: Example-based Image Synthesis , 2017, ICLR.

[24]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[26]  Ming-Hsuan Yang,et al.  Diversified Texture Synthesis with Feed-Forward Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[28]  David Zhang,et al.  FSIM: A Feature Similarity Index for Image Quality Assessment , 2011, IEEE Transactions on Image Processing.

[29]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[31]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[32]  Xinlei Chen,et al.  PixelNet: Representation of the pixels, by the pixels, and for the pixels , 2017, ArXiv.

[33]  Sung Yong Shin,et al.  On pixel-based texture synthesis by non-parametric sampling , 2006, Comput. Graph..

[34]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[35]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[36]  Seong Joon Oh,et al.  Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).