ZM-Net: Real-time Zero-shot Image Manipulation Network

Many problems in image processing and computer vision (e.g. colorization, style transfer) can be posed as 'manipulating' an input image into a corresponding output image given a user-specified guiding signal. A holy-grail solution towards generic image manipulation should be able to efficiently alter an input image with any personalized signals (even signals unseen during training), such as diverse paintings and arbitrary descriptive attributes. However, existing methods are either inefficient to simultaneously process multiple signals (let alone generalize to unseen signals), or unable to handle signals from other modalities. In this paper, we make the first attempt to address the zero-shot image manipulation task. We cast this problem as manipulating an input image according to a parametric model whose key parameters can be conditionally generated from any guiding signal (even unseen ones). To this end, we propose the Zero-shot Manipulation Net (ZM-Net), a fully-differentiable architecture that jointly optimizes an image-transformation network (TNet) and a parameter network (PNet). The PNet learns to generate key transformation parameters for the TNet given any guiding signal while the TNet performs fast zero-shot image manipulation according to both signal-dependent parameters from the PNet and signal-invariant parameters from the TNet itself. Extensive experiments show that our ZM-Net can perform high-quality image manipulation conditioned on different forms of guiding signals (e.g. style images and attributes) in real-time (tens of milliseconds per image) even for unseen signals. Moreover, a large-scale style dataset with over 20,000 style images is also constructed to promote further research.

[1]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Ming-Hsuan Yang,et al.  Diversified Texture Synthesis with Feed-Forward Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  Jiaying Liu,et al.  Demystifying Neural Style Transfer , 2017, IJCAI.

[6]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Corneliu Florea,et al.  Pandora: Description of a painting database for art movement recognition with baselines and perspectives , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[8]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrea Vedaldi,et al.  Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[10]  Neus Sabater,et al.  Split and Match: Example-Based Adaptive Patch Sampling for Unsupervised Style Transfer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[12]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[16]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bin Sheng,et al.  Deep Colorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  David Salesin,et al.  Image Analogies , 2001, SIGGRAPH.

[23]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[25]  Chuan Li,et al.  Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks , 2016, ECCV.