Cascaded ConvLSTMs Using Semantically-Coherent Data Synthesis for Video Object Segmentation

This paper proposes a simple yet effective and efficient method for video object segmentation. Most existing methods take the color image and the optical flow as input for discovering the salient object in terms of appearance and motion. We instead leverage a ResNet backbone as an appearance-characterization encoder for each frame at different scales, and a series of Convolutional Long Short-Term Memory units (ConvLSTMs) as a motion-modeling decoder at each corresponding scale. By imposing supervision over each scale, such modules can well tackle all scales of a moving object with an inevitable scale variance over time. Instead of following a Condition Random Fields based post-processing, we use a more effective and efficient cascade module to refine the model predictions. Most existing video object segmentation datasets have limited sizes because it is expensive and time-consuming to obtain pixel-wise annotations. To overcome the data-insufficiency issue when training the deep network, we propose a semantically-coherent data synthesis strategy to augment training sequences without any efforts. Extensive experiments and ablation studies on the DAVIS 2016 dataset validate our proposed method. Furthermore, our method without the cascade module achieves a real-time speed of 26 fps on a single GPU.

[1]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yong Jae Lee,et al.  Track and Segment: An Iterative Unsupervised Approach for Video Object Proposals , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[5]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[6]  Martin Jägersand,et al.  Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[7]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jun-Sik Kim,et al.  Pixel-Level Matching for Video Object Segmentation Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  R. Venkatesh Babu,et al.  SeamSeg: Video Object Segmentation Using Patch Seams , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Michael J. Black,et al.  Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  C.-C. Jay Kuo,et al.  Unsupervised Video Object Segmentation with Motion-Based Bilateral Networks , 2018, ECCV.

[16]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alexander G. Schwing,et al.  VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[18]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[19]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[20]  Sanyuan Zhao,et al.  Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[21]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[25]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[29]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dani Lischinski,et al.  JumpCut , 2015, ACM Trans. Graph..

[32]  Miriam Bellver,et al.  Recurrent Neural Networks for Semantic Instance Segmentation , 2017, ArXiv.

[33]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Chang-Su Kim,et al.  Primary Object Segmentation in Videos Based on Region Augmentation and Reduction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Markus H. Gross,et al.  Fully Connected Object Proposals for Video Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Cordelia Schmid,et al.  Modeling Visual Context is Key to Augmenting Object Detection Datasets , 2018, ECCV.

[38]  Ming-Hsuan Yang,et al.  Fast and Accurate Online Video Object Segmentation via Tracking Parts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Peter V. Gehler,et al.  Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Zhenbao Liu,et al.  Vehicle Detection in Aerial Images Using Rotation-Invariant Cascaded Forest , 2019, IEEE Access.

[41]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Thomas Brox,et al.  Lucid Data Dreaming for Video Object Segmentation , 2017, International Journal of Computer Vision.

[43]  Alexander Sorkine-Hornung,et al.  Bilateral Space Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Qin Huang,et al.  Instance Embedding Transfer to Unsupervised Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Ning Xu,et al.  YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.