论文信息 - Adaptive Compact Attention For Few-shot Video-to-video Translation

Adaptive Compact Attention For Few-shot Video-to-video Translation

This paper proposes an adaptive compact attention model for few-shot video-to-video translation. Existing works in this domain only use features from pixel-wise attention without considering the correlations among multiple reference images, which leads to heavy computation but limited performance. Therefore, we introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images, of which encoded view-dependent and motion-dependent information can significantly benefit the synthesis of realistic videos. Our core idea is to extract compact basis sets from all the reference images as higher-level representations. To further improve the reliability, in the inference phase, we also propose a novel method based on the Delaunay Triangulation algorithm to automatically select the resourceful references according to the input label. We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset; the experimental results show the superior performance of our method for producing photorealistic and temporally consistent videos, and considerable improvements over the state-of-the-art method.

Risheng Huang | Li Shen | Xuan Wang | Hao-Zhi Huang | Cheng Lin

[1] Marco Körner,et al. FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing Autoencoder GANs , 2018, ArXiv.

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Jan Kautz,et al. Few-shot Video-to-Video Synthesis , 2019, NeurIPS.

[4] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[6] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Zhibin Hong,et al. ACFNet: Attentional Class Feature Network for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[10] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[11] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[12] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13] Jan Kautz,et al. MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Jan Kautz,et al. Video-to-Video Synthesis , 2018, NeurIPS.

[15] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[16] Yi Zhang,et al. PSANet: Point-wise Spatial Attention Network for Scene Parsing , 2018, ECCV.

[17] David Salesin,et al. Image Analogies , 2001, SIGGRAPH.

[18] Petros Koumoutsakos,et al. ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[19] Jian Dong,et al. Attentive Contexts for Object Detection , 2016, IEEE Transactions on Multimedia.

[20] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[21] Sandra Aigner,et al. FUTUREGAN: ANTICIPATING THE FUTURE FRAMES OF VIDEO SEQUENCES USING SPATIO-TEMPORAL 3D CONVOLUTIONS IN PROGRESSIVELY GROWING GANS , 2018, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences.

[22] Shunta Saito,et al. Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Francesc Moreno-Noguer,et al. GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[24] Hong Liu,et al. Expectation-Maximization Attention Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Yu Tian,et al. Learning to Forecast and Refine Residual Motion for Image-to-Video Generation , 2018, ECCV.

[26] Frédo Durand,et al. Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Alexei A. Efros,et al. Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Ruben Villegas,et al. Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[29] Ming-Hsuan Yang,et al. Flow-Grounded Spatial-Temporal Video Prediction from Still Images , 2018, ECCV.

[30] Andreas Rössler,et al. FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Iasonas Kokkinos,et al. DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.