Temporally Coherent Video Harmonization Using Adversarial Networks

Compositing is one of the most important editing operations for images and videos. The process of improving the realism of composite results is often called harmonization. Previous approaches for harmonization mainly focus on images. In this paper, we take one step further to attack the problem of video harmonization. Specifically, we train a convolutional neural network in an adversarial way, exploiting a pixel-wise disharmony discriminator to achieve more realistic harmonized results and introducing a temporal loss to increase temporal consistency between consecutive harmonized frames. Thanks to the pixel-wise disharmony discriminator, we are also able to relieve the need of input foreground masks. Since existing video datasets which have ground-truth foreground masks and optical flows are not sufficiently large, we propose a simple yet efficient method to build up a synthetic dataset supporting supervised training of the proposed adversarial network. The experiments show that training on our synthetic dataset generalizes well to the real-world composite dataset. In addition, our method successfully incorporates temporal consistency during training and achieves more harmonious visual results than previous methods.

[1]  Li Fei-Fei,et al.  Characterizing and Improving Stability in Neural Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Shi-Min Hu,et al.  Robust background identification for dynamic video editing , 2016, ACM Trans. Graph..

[3]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[4]  Shijian Lu,et al.  YoTube: Searching Action Proposal Via Recurrent and Static Regression Networks , 2017, IEEE Transactions on Image Processing.

[5]  Micah K. Johnson,et al.  Multi-scale image harmonization , 2010, ACM Trans. Graph..

[6]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Wei Liu,et al.  Unsupervised Image-to-Image Translation with Stacked Cycle-Consistent Adversarial Networks , 2018, ECCV.

[8]  Alexei A. Efros,et al.  Using Color Compatibility for Assessing Image Realism , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Thomas Brox,et al.  Lucid Data Dreaming for Video Object Segmentation , 2017, International Journal of Computer Vision.

[11]  Sylvain Paris,et al.  Blind video temporal consistency , 2015, ACM Trans. Graph..

[12]  Xiaochun Cao,et al.  Deep Video Dehazing With Semantic Segmentation , 2019, IEEE Transactions on Image Processing.

[13]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[14]  Sylvain Paris,et al.  Example-based video color grading , 2013, ACM Trans. Graph..

[15]  Thomas Brox,et al.  Lucid Data Dreaming for Object Tracking , 2017, ArXiv.

[16]  Julie Dorsey,et al.  Understanding and improving the realism of image composites , 2012, ACM Trans. Graph..

[17]  Ali Farhadi,et al.  Imagine This! Scripts to Compositions to Videos , 2018, ECCV.

[18]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Erik Reinhard,et al.  Color Transfer between Images , 2001, IEEE Computer Graphics and Applications.

[20]  Patrick Pérez,et al.  Region filling and object removal by exemplar-based image inpainting , 2004, IEEE Transactions on Image Processing.

[21]  Markus Gross,et al.  Practical temporal consistency for image-based graphics applications , 2012, ACM Trans. Graph..

[22]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[23]  Thomas Brox,et al.  Artistic Style Transfer for Videos , 2016, GCPR.

[24]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[26]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ming-Hsuan Yang,et al.  Deep Image Harmonization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Tomas Pfister,et al.  Learning from Simulated and Unsupervised Images through Adversarial Training , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Wei Liu,et al.  MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[31]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Aljoscha Smolic,et al.  Suplemental Material for Temporally Coherent Local Tone Mapping of HDR Video , 2014 .

[33]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[34]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37]  Tong Zhang,et al.  Neural Stereoscopic Image Style Transfer , 2018, ECCV.

[38]  Wenbo Gong,et al.  Wasserstein Generative Adversarial Network , 2017 .

[39]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[40]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[41]  Sylvain Paris,et al.  Error-Tolerant Image Compositing , 2010, International Journal of Computer Vision.

[42]  Alexei A. Efros,et al.  Learning a Discriminative Model for the Perception of Realism in Composite Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[44]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Shi-Min Hu,et al.  Motion-Aware Gradient Domain Video Composition , 2013, IEEE Transactions on Image Processing.

[46]  Wei Liu,et al.  CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Daniel Cohen-Or,et al.  Color harmonization , 2006, ACM Trans. Graph..

[48]  Joo-Hwee Lim,et al.  DehazeGAN: When Image Dehazing Meets Differential Programming , 2018, IJCAI.

[49]  Qionghai Dai,et al.  Intrinsic video and applications , 2014, ACM Trans. Graph..

[50]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[51]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[53]  Matthias Grossglauser,et al.  Fast and Accurate Inference of Plackett-Luce Models , 2015, NIPS.

[54]  Hongdong Li,et al.  Adversarial Spatio-Temporal Learning for Video Deblurring , 2018, IEEE Transactions on Image Processing.

[55]  Wei Xiong,et al.  Learning to Generate Time-Lapse Videos Using Multi-stage Dynamic Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Paul L. Rosin,et al.  Practical automatic background substitution for live video , 2017, Computational Visual Media.

[58]  Hao Wang,et al.  Real-Time Neural Style Transfer for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[60]  Jian Sun,et al.  Drag-and-drop pasting , 2006, SIGGRAPH 2006.