HyperCon: Image-To-Video Model Transfer for Video-To-Video Translation Tasks

Video-to-video translation for super-resolution, inpainting, style transfer, etc. is more difficult than corresponding image-to-image translation tasks due to the temporal consistency problem that, if left unaddressed, results in distracting flickering effects. Although video models designed from scratch produce temporally consistent results, training them to match the vast visual knowledge captured by image models requires an intractable number of videos. To combine the benefits of image and video models, we propose an image-to-video model transfer method called Hyperconsistency (HyperCon) that transforms any well-trained image model into a temporally consistent video model without fine-tuning. HyperCon works by translating a synthetic temporally interpolated video frame-wise and then aggregating over temporally localized windows on the interpolated video. It handles both masked and unmasked inputs, enabling support for even more video-to-video tasks than prior image-to-video model transfer techniques. We demonstrate HyperCon on video style transfer and inpainting, where it performs favorably compared to prior state-of-the-art video consistency and video inpainting methods, all without training on a single stylized or incomplete video.

[1]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[3]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[5]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jon Barker,et al.  SDC-Net: Video Prediction Using Spatially-Displaced Convolution , 2018, ECCV.

[7]  Peter V. Gehler,et al.  Intrinsic Video , 2014, ECCV.

[8]  William T. Freeman,et al.  A High-Quality Video Denoising Algorithm Based on Reliable Motion Estimation , 2010, ECCV.

[9]  Ning Xu,et al.  YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[10]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ersin Yumer,et al.  Learning Blind Video Temporal Consistency , 2018, ECCV.

[12]  Ning Xu,et al.  An Internal Learning Approach to Video Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Alexei A. Efros,et al.  Real-time user-guided image colorization with learned deep priors , 2017, ACM Trans. Graph..

[14]  Xin Wang,et al.  Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation , 2018, ACM Multimedia.

[15]  Thomas Brox,et al.  Artistic Style Transfer for Videos , 2016, GCPR.

[16]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[17]  In So Kweon,et al.  Preserving Semantic and Temporal Consistency for Unpaired Video-to-Video Translation , 2019, ACM Multimedia.

[18]  Sylvain Paris,et al.  Blind video temporal consistency , 2015, ACM Trans. Graph..

[19]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jan Kautz,et al.  Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Winston H. Hsu,et al.  Learnable Gated Temporal Shift Module for Deep Video Inpainting , 2019 .

[24]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[25]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[26]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Chuan Wang,et al.  Video Inpainting by Jointly Learning Temporal Structure and Spatial Details , 2018, AAAI.

[28]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Jason J. Corso,et al.  A Temporally-Aware Interpolation Network for Video Frame Inpainting , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[31]  In So Kweon,et al.  Deep Video Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Ting-Chun Wang,et al.  Image Inpainting for Irregular Holes Using Partial Convolutions , 2018, ECCV.

[34]  Thomas S. Huang,et al.  Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Tao Mei,et al.  Mocycle-GAN: Unpaired Video-to-Video Translation , 2019, ACM Multimedia.

[37]  Li Fei-Fei,et al.  Characterizing and Improving Stability in Neural Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bolei Zhou,et al.  Deep Flow-Guided Video Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Seoung Wug Oh,et al.  Copy-and-Paste Networks for Deep Video Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[42]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[43]  Shao-Yi Chien,et al.  Occlusion-aware Video Temporal Consistency , 2017, ACM Multimedia.

[44]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[45]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Matthew A. Brown,et al.  Frame-Recurrent Video Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Alan L. Yuille,et al.  Region-based temporally consistent video post-processing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[49]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[50]  Pengfei Xiong,et al.  Deep Fusion Network for Image Completion , 2019, ACM Multimedia.

[51]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).