论文信息 - Learning What to Learn for Video Object Segmentation

Learning What to Learn for Video Object Segmentation

Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module. This internal learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond standard few-shot learning techniques by learning what the few-shot learner should learn. This allows us to achieve a rich internal representation of the target in the current frame, significantly increasing the segmentation accuracy of our approach. We perform extensive experiments on multiple benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result.

[1] Philip H.S. Torr,et al. Siam R-CNN: Visual Tracking by Re-Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Aggelos K. Katsaggelos,et al. Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Gang Wang,et al. Motion-Guided Cascaded Refinement Network for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6] Xiaojuan Qi,et al. AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[8] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9] Bernt Schiele,et al. Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Philip H. S. Torr,et al. Meta-Learning Deep Visual Words for Fast Video Object Segmentation , 2018, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11] Bernhard Rinner,et al. Adaptive cartooning for privacy protection in camera networks , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[12] Ning Xu,et al. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[13] Alexander C. Berg,et al. Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers , 2018, ECCV.

[14] Saeid Nahavandi,et al. Kangaroo Vehicle Collision Detection Using Deep Semantic Segmentation Convolutional Neural Network , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[15] Kalyan Sunkavalli,et al. Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Bastian Leibe,et al. Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[17] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Subhransu Maji,et al. Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Bastian Leibe,et al. PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[20] Ning Xu,et al. Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Sebastian Ramos,et al. Vision-Based Offline-Online Perception Paradigm for Autonomous Driving , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[22] Ling Shao,et al. RANet: Ranking Attention Network for Fast Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Luca Bertinetto,et al. Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[24] Michael Felsberg,et al. Learning Fast and Robust Target Models for Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.

[26] Qiang Wang,et al. Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Luc Van Gool,et al. One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Gang Yu,et al. Learning a Discriminative Feature Network for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Michael Felsberg,et al. A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Ian D. Reid,et al. Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[31] Bastian Leibe,et al. FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] K.-K. Maninis,et al. Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Ning Xu,et al. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[34] Luc Van Gool,et al. Probabilistic Regression for Visual Tracking , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Alexander G. Schwing,et al. VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[36] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[38] Matthew B. Blaschko,et al. The Lovasz-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure in Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Gérard G. Medioni,et al. Detecting and tracking moving objects for video surveillance , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[40] Junseok Kwon,et al. Deep Meta Learning for Real-Time Target-Aware Visual Tracking , 2017, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41] Luc Van Gool,et al. Learning Discriminative Model Prediction for Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).