Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video

We propose a method to decompose a video into a background and a set of foreground layers, where the background captures stationary elements while the foreground layers capture moving objects along with their associated effects (e.g. shadows and reflections). Our approach is designed for unconstrained monocular videos, with an arbitrary camera and object motion. Prior work that tackles this problem assumes that the video can be mapped onto a fixed 2 D can-vas, severely limiting the possible space of camera motion. Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object. To solve the underconstrained decomposition problem, we propose a new loss formulation based on multi-view consistency. We test our method on challenging videos with complex camera motion and show significant qualitative improvement over current approaches.

[1]  Dong Liu,et al.  Flow-Guided Transformer for Video Inpainting , 2022, ECCV.

[2]  J. Kopf,et al.  Boosting View Synthesis with Residual Transfer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Noah Snavely,et al.  Deformable Sprites for Unsupervised Video Decomposition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Tali Dekel,et al.  Text2LIVE: Text-Driven Layered Image and Video Editing , 2022, ECCV.

[5]  Federico Tombari,et al.  Neural Fields in Visual Computing and Beyond , 2021, Comput. Graph. Forum.

[6]  Tali Dekel,et al.  Layered neural atlases for consistent video editing , 2021, ACM Trans. Graph..

[7]  Avneesh Sud,et al.  Differentiable Surface Rendering via Non-Differentiable Sampling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  W. Freeman,et al.  Consistent depth of moving objects in video , 2021, ACM Transactions on Graphics.

[9]  William T. Freeman,et al.  Omnimatte: Associating Objects and Their Effects in Video , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Johannes Kopf,et al.  Dynamic View Synthesis from Dynamic Monocular Video , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jingyi Yu,et al.  Editable free-viewpoint video using a layered neural representation , 2021, ACM Trans. Graph..

[12]  Alexei A. Efros,et al.  MarioNette: Self-Supervised Sprite Learning , 2021, NeurIPS.

[13]  Yingyu Liang,et al.  Deep Online Fused Video Stabilization , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[14]  J. Kopf,et al.  Robust Consistent Video Depth Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Zhengqi Li,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  David Salesin,et al.  Layered neural rendering for retiming people in video , 2020, ACM Trans. Graph..

[17]  Chen Gao,et al.  Flow-edge Guided Video Completion , 2020, ECCV.

[18]  Jonathan T. Barron,et al.  NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[20]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[21]  Andrew Zisserman,et al.  Controllable Attention for Structured Layered Video Decomposition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Seoung Wug Oh,et al.  Copy-and-Paste Networks for Deep Video Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Seoung Wug Oh,et al.  Onion-Peel Networks for Deep Video Completion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Bolei Zhou,et al.  Deep Flow-Guided Video Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jonathan T. Barron,et al.  Pushing the Boundaries of View Extrapolation With Multiplane Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Winston H. Hsu,et al.  Free-Form Video Inpainting With 3D Gated Convolution and Temporal PatchGAN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Chao Liu,et al.  Neural RGB®D Sensing: Depth and Uncertainty From a Video Camera , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jia Deng,et al.  DeepV2D: Video to Depth with Differentiable Structure from Motion , 2018, ICLR.

[29]  Andrew Zisserman,et al.  The Visual Centrifuge: Model-Free Layered Video Representations , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Michal Irani,et al.  “Double-DIP”: Unsupervised Image Decomposition via Coupled Deep-Image-Priors , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Chuan Wang,et al.  Video Inpainting by Jointly Learning Temporal Structure and Spatial Details , 2018, AAAI.

[32]  John Flynn,et al.  Stereo magnification , 2018, ACM Trans. Graph..

[33]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[34]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[35]  Narendra Ahuja,et al.  Temporally coherent completion of dynamic video , 2016, ACM Trans. Graph..

[36]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Christine Guillemot,et al.  Video Inpainting With Short-Term Windows: Application to Object Removal and Error Concealment , 2015, IEEE Transactions on Image Processing.

[38]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[39]  Oliver Grau,et al.  How Not to Be Seen — Object Removal from Videos of Crowded Scenes , 2012, Comput. Graph. Forum.

[40]  Patrick Pérez,et al.  Semi-automatic Motion Segmentation with Motion Layer Mosaics , 2008, ECCV.

[41]  Eli Shechtman,et al.  Space-Time Completion of Video , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Harry Shum,et al.  Full-frame video stabilization with motion inpainting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Andrew Zisserman,et al.  Learning Layered Motion Segmentations of Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[44]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[45]  Richard Szeliski,et al.  Layered depth images , 1998, SIGGRAPH.

[46]  Edward H. Adelson,et al.  Representing moving images with layers , 1994, IEEE Trans. Image Process..

[47]  Tom Duff,et al.  Compositing digital images , 1984, SIGGRAPH.

[48]  W. Freeman,et al.  Structure and Motion from Casual Videos , 2022, ECCV.

[49]  Irfan A. Essa,et al.  Motion based decompositing of video , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.