Mocycle-GAN: Unpaired Video-to-Video Translation

Unsupervised image-to-image translation is the task of translating an image from one domain to another in the absence of any paired training examples and tends to be more applicable to practical applications. Nevertheless, the extension of such synthesis from image-to-image to video-to-video is not trivial especially when capturing spatio-temporal structures in videos. The difficulty originates from the aspect that not only the visual appearance in each frame but also motion between consecutive frames should be realistic and consistent across transformation. This motivates us to explore both appearance structure and temporal continuity in video synthesis. In this paper, we present a new Motion-guided Cycle GAN, dubbed as Mocycle-GAN, that novelly integrates motion estimation into unpaired video translator. Technically, Mocycle-GAN capitalizes on three types of constrains: adversarial constraint discriminating between synthetic and real frame, cycle consistency encouraging an inverse translation on both frame and motion, and motion translation validating the transfer of motion between consecutive frames. Extensive experiments are conducted on video-to-labels and labels-to-video translation, and superior results are reported when comparing to state-of-the-art methods. More remarkably, we qualitatively demonstrate our Mocycle-GAN for both flower-to-flower and ambient condition transfer.

[1]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[3]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Xin Wang,et al.  Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation , 2018, ACM Multimedia.

[5]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tao Mei,et al.  Deep Semantic Hashing with Generative Adversarial Networks , 2017, SIGIR.

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Honglak Lee,et al.  Exploring the structure of a real-time, arbitrary neural artistic stylization network , 2017, BMVC.

[9]  Kwang In Kim,et al.  Unsupervised Attention-guided Image to Image Translation , 2018, NeurIPS.

[10]  Ming-Yu Liu,et al.  Coupled Generative Adversarial Networks , 2016, NIPS.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Nenghai Yu,et al.  Coherent Online Video Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yizhou Yu,et al.  ReCoNet: Real-time Coherent Video Style Transfer Network , 2018, ACCV.

[17]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[19]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[20]  Andrea Vedaldi,et al.  Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[21]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jun Zhu,et al.  Video-to-Video Translation with Global Temporal Consistency , 2018, ACM Multimedia.

[23]  Hao Wang,et al.  Real-Time Neural Style Transfer for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hang Zhang,et al.  Multi-style Generative Network for Real-time Transfer , 2017, ECCV Workshops.

[25]  Hyunsoo Kim,et al.  Learning to Discover Cross-Domain Relations with Generative Adversarial Networks , 2017, ICML.

[26]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[27]  Bruno A. Olshausen,et al.  DeepMovie: Using Optical Flow and Deep Neural Networks to Stylize Movies , 2016, ArXiv.

[28]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Brox,et al.  Artistic Style Transfer for Videos , 2016, GCPR.

[30]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[31]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[32]  Li Fei-Fei,et al.  Characterizing and Improving Stability in Neural Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[36]  Tao Mei,et al.  Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure , 2016, IJCAI.

[37]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).