SPFTN: A Joint Learning Framework for Localizing and Segmenting Objects in Weakly Labeled Videos

Object localization and segmentation in weakly labeled videos are two interesting yet challenging tasks. Models built for simultaneous object localization and segmentation have been explored in the conventional fully supervised learning scenario to boost the performance of each task. However, none of the existing works has attempted to jointly learn object localization and segmentation models under weak supervision. To this end, we propose a joint learning framework called Self-Paced Fine-Tuning Network (SPFTN) for localizing and segmenting objects in weakly labelled videos. Learning the deep model jointly for object localization and segmentation under weak supervision is very challenging as the learning process of each single task would face serious ambiguity issue due to the lack of bounding-box or pixel-level supervision. To address this problem, our proposed deep SPFTN model is carefully designed with a novel multi-task self-paced learning objective, which leverages the task-specific prior knowledge and the knowledge that has been already captured to infer the confident training samples for each task. By aggregating the confident knowledge from each single task to mine reliable patterns and learning deep feature representation for both tasks, the proposed learning framework can address the ambiguity issue under weak supervision with simple optimization. Comprehensive experiments on the large-scale YouTube-Objects and DAVIS datasets demonstrate that the proposed approach achieves superior performance when compared with other state-of-the-art methods and the baseline networks/models.

[1]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[2]  Ming-Hsuan Yang,et al.  Semantic Co-segmentation in Videos , 2016, ECCV.

[3]  Jason J. Corso,et al.  (BP)2: Beyond pairwise Belief Propagation labeling by approximating Kikuchi free energies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[5]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[6]  Deyu Meng,et al.  Co-Saliency Detection via a Self-Paced Multiple-Instance Learning Framework , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Stephen Lin,et al.  Object-Based Multiple Foreground Video Co-segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Feng Wu,et al.  Background Prior-Based Salient Object Detection via Deep Reconstruction Residual , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Chenliang Xu,et al.  Actor-Action Semantic Segmentation with Grouping Process Models , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yang Wang,et al.  Weakly supervised object localization and segmentation in videos , 2016, Image Vis. Comput..

[11]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[13]  Long-Wen Chang,et al.  Video object cosegmentation , 2012, ACM Multimedia.

[14]  Kristen Grauman,et al.  Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[15]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[17]  Fei-Fei Li,et al.  Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  James M. Rehg,et al.  Weakly Supervised Learning of Object Segmentations from Web-Scale Video , 2012, ECCV Workshops.

[20]  Markus H. Gross,et al.  Fully Connected Object Proposals for Video Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Jason J. Corso,et al.  Propagating multi-class pixel labels throughout video frames , 2010, 2010 Western New York Image Processing Workshop.

[22]  Mubarak Shah,et al.  Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Chen Wang,et al.  Semantic object segmentation via detection in weakly labeled video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[27]  Fei-Fei Li,et al.  Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[28]  Zhuwen Li,et al.  Video Co-segmentation for Meaningful Action Extraction , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Xiao Liu,et al.  Weakly Supervised Multiclass Video Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Deyu Meng,et al.  Bridging Saliency Detection to Weakly Supervised Object Detection Based on Self-Paced Curriculum Learning , 2016, IJCAI.

[31]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[32]  Xuelong Li,et al.  DISC: Deep Image Saliency Computing via Progressive Representation Learning , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Mubarak Shah,et al.  Video Object Co-segmentation by Regulated Maximum Weight Cliques , 2014, ECCV.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Junwei Han,et al.  DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Ran Xu,et al.  Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Michal Irani,et al.  Video Segmentation by Non-Local Consensus voting , 2014, BMVC.

[39]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[40]  Wataru Shimoda,et al.  Distinct Class-Specific Saliency Maps for Weakly Supervised Semantic Segmentation , 2016, ECCV.

[41]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Dong Xu,et al.  SPFTN: A Self-Paced Fine-Tuning Network for Segmenting Objects in Weakly Labelled Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Tianming Liu,et al.  Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[45]  Xiaojin Gong,et al.  Saliency Guided End-to-End Learning for Weakly Supervised Object Detection , 2017, IJCAI.

[46]  Chang-Su Kim,et al.  Primary Object Segmentation in Videos Based on Region Augmentation and Reduction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  R. Nevatia,et al.  Simultaneous Object Detection and Segmentation by Boosting Local Shape Feature based Classifier , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Thomas Brox,et al.  Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions , 2011, 2011 International Conference on Computer Vision.

[49]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[50]  Brian Taylor,et al.  Causal video object segmentation from persistence of occlusions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[54]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jean Ponce,et al.  Unsupervised Object Discovery and Tracking in Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[56]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Karteek Alahari,et al.  Weakly-Supervised Semantic Segmentation Using Motion Cues , 2016, ECCV.

[59]  Chenliang Xu,et al.  Weakly Supervised Actor-Action Segmentation via Robust Multi-task Ranking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).