Learning to Segment Instances in Videos with Spatial Propagation Network

We propose a deep learning-based framework for instance-level object segmentation. Our method mainly consists of three steps. First, We train a generic model based on ResNet-101 for foreground/background segmentations. Second, based on this generic model, we fine-tune it to learn instance-level models and segment individual objects by using augmented object annotations in first frames of test videos. To distinguish different instances in the same video, we compute a pixel-level score map for each object from these instance-level models. Each score map indicates the objectness likelihood and is only computed within the foreground mask obtained in the first step. To further refine this per frame score map, we learn a spatial propagation network. This network aims to learn how to propagate a coarse segmentation mask spatially based on the pairwise similarities in each frame. In addition, we apply a filter on the refined score map that aims to recognize the best connected region using spatial and temporal consistencies in the video. Finally, we decide the instance-level object segmentation in each video by comparing score maps of different instances.

[1]  Thomas Brox,et al.  Video Segmentation with Just a Few Strokes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[3]  Bernt Schiele,et al.  Video Segmentation with Superpixels , 2012, ACCV.

[4]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[5]  Thomas Brox,et al.  A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Alexander Sorkine-Hornung,et al.  Bilateral Space Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[10]  Ming-Hsuan Yang,et al.  Multi-instance object segmentation with occlusion handling , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[13]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ming-Hsuan Yang,et al.  Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network , 2016, ECCV.

[15]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[16]  Markus H. Gross,et al.  Fully Connected Object Proposals for Video Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Mubarak Shah,et al.  Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Esa Rahtu,et al.  Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[23]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[24]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[25]  Michael J. Black,et al.  Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sanja Fidler,et al.  Instance-Level Segmentation with Deep Densely Connected MRFs , 2015, ArXiv.

[29]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[30]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kristen Grauman,et al.  Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[32]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  James M. Rehg,et al.  Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.