Spatio-Temporal Multi-Flow Network for Video Frame Interpolation

Video frame interpolation (VFI) is currently a very active research topic, with applications spanning computer vision, post production and video encoding. VFI can be extremely challenging, particularly in sequences containing large motions, occlusions or dynamic textures, where existing approaches fail to offer perceptually robust interpolation performance. In this context, we present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture. ST-MFNet employs a new multi-scale multi-flow predictor to estimate many-to-one intermediate flows, which are combined with conventional one-to-one optical flows to capture both large and complex motions. In order to enhance interpolation performance for various textures, a 3D CNN is also employed to model the content dynamics over an extended temporal window. Moreover, ST-MFNet has been trained within an ST-GAN framework, which was originally developed for texture synthesis, with the aim of further improving perceptual interpolation quality. Our approach has been comprehensively evaluated – compared with fourteen state-of-the-art VFI algorithms – clearly demonstrating that ST-MFNet consistently outperforms these benchmarks on varied and representative test datasets, with significant gains up to 1.09dB in PSNR for cases including large motions and dynamic textures. Project page: https://danielism97.github.io/ST-MFNet.

[1]  Feng Liu,et al.  Context-Aware Synthesis for Video Frame Interpolation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Marko Viitanen,et al.  UVG dataset: 50/120fps 4K sequences for video codec analysis and development , 2020, MMSys.

[3]  Alain Trémeau,et al.  Residual Conv-Deconv Grid Network for Semantic Segmentation , 2017, BMVC.

[4]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[5]  David Bull,et al.  Texture-aware Video Frame Interpolation , 2021, 2021 Picture Coding Symposium (PCS).

[6]  Djemel Ziou,et al.  Image Quality Metrics: PSNR vs. SSIM , 2010, 2010 20th International Conference on Pattern Recognition.

[7]  David Lopez-Paz,et al.  Optimizing the Latent Space of Generative Networks , 2017, ICML.

[8]  Taeoh Kim,et al.  AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Alexandre Lacoste,et al.  Quantifying the Carbon Emissions of Machine Learning , 2019, ArXiv.

[10]  Zhibo Chen,et al.  Spatiotemporal Generative Adversarial Network-Based Dynamic Texture Synthesis for Surveillance Video Coding , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Song-Chun Zhu,et al.  Learning Energy-Based Spatial-Temporal Generative ConvNets for Dynamic Patterns , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Bohyung Han,et al.  Channel Attention Is All You Need for Video Frame Interpolation , 2020, AAAI.

[15]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jun Chen,et al.  Video Interpolation via Generalized Deformable Convolution , 2020, ArXiv.

[17]  Du Tran,et al.  FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation , 2020, ArXiv.

[18]  Yochai Blau,et al.  The Perception-Distortion Tradeoff , 2017, CVPR.

[19]  Qian Yin,et al.  Quadratic video interpolation , 2019, NeurIPS.

[20]  Sugato Chakravarty,et al.  Methodology for the subjective assessment of the quality of television pictures , 1995 .

[21]  Jan Kautz,et al.  Unsupervised Video Interpolation Using Cycle Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Hongdong Li,et al.  Learning Image Matching by Simply Watching Video , 2016, ECCV.

[23]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Gui-Song Xia,et al.  Stationary dynamic texture synthesis using convolutional neural networks , 2016, 2016 IEEE 13th International Conference on Signal Processing (ICSP).

[25]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Dacheng Tao,et al.  FeatureFlow: Robust Video Interpolation via Structure-to-Texture Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Chang-Su Kim,et al.  Asymmetric Bilateral Motion Estimation for Video Frame Interpolation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Marc Levoy,et al.  Fast texture synthesis using tree-structured vector quantization , 2000, SIGGRAPH.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Haojie Liu,et al.  PDWN: Pyramid Deformable Warping Network for Video Interpolation , 2021, IEEE Open Journal of Signal Processing.

[33]  Chang-Su Kim,et al.  BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation , 2020, ECCV.

[34]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[36]  Gary J. Sullivan,et al.  Overview of the High Efficiency Video Coding (HEVC) Standard , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[37]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[38]  Shuchang Zhou,et al.  RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation , 2020, ECCV.

[39]  Jan Kautz,et al.  Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Ronggang Wang,et al.  A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation , 2020, ECCV.

[41]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Convolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Xiaoyun Zhang,et al.  Depth-Aware Video Frame Interpolation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Feng Liu,et al.  Softmax Splatting for Video Frame Interpolation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Fan Zhang,et al.  A Parametric Framework for Video Compression Using Region-Based Texture Models , 2011, IEEE Journal of Selected Topics in Signal Processing.

[45]  Chao-Yuan Wu,et al.  Video Compression through Image Interpolation , 2018, ECCV.

[46]  Zhenzhong Chen,et al.  Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[48]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[49]  Leon A. Gatys,et al.  Texture Synthesis Using Convolutional Neural Networks , 2015, NIPS.

[50]  Zhihui Zhu,et al.  CDFI: Compression-Driven Network Design for Frame Interpolation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Zhiyong Gao,et al.  MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Munchurl Kim,et al.  XVFI: eXtreme Video Frame Interpolation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Fan Zhang,et al.  BVI-DVC: A Training Database for Deep Video Compression , 2021, IEEE Transactions on Multimedia.

[54]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[55]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Yu Qiao,et al.  Enhanced Quadratic Video Interpolation , 2020, ECCV Workshops.

[57]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[58]  Fan Zhang,et al.  CVEGAN: A Perceptually-inspired GAN for Compressed Video Enhancement , 2020, ArXiv.

[59]  Gui-Song Xia,et al.  Conditional Generative ConvNets for Exemplar-based Texture Synthesis , 2021, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.