Video Object Segmentation by Salient Segment Chain Composition

We present a model for video segmentation, applicable to RGB (and if available RGB-D) information that constructs multiple plausible partitions corresponding to the static and the moving objects in the scene: i) we generate multiple figure-ground segmentations, in each frame, parametrically, based on boundary and optical flow cues, then track, link and refine the salient segment chains corresponding to the different objects, over time, using long-range temporal constraints, ii) a video partition is obtained by composing segment chains into consistent tilings, where the different individual object chains explain the video and do not overlap. Saliency metrics based on figural and motion cues, as well as measures learned from human eye movements are exploited, with substantial gain, at the level of segment generation and chain construction, in order to produce compact sets of hypotheses which correctly reflect the qualities of the different configurations. The model makes it possible to compute multiple hypotheses over both individual object segmentations tracked over time, and for complete video partitions. We report quantitative, state of the art results in the SegTrack single object benchmark, and promising qualitative and quantitative results in clips filming multiple static and moving objects collected from Hollywood movies and from the MIT dataset.

[1]  Eric L. Miller,et al.  Segmentation fusion for connectomics , 2011, 2011 International Conference on Computer Vision.

[2]  Cristian Sminchisescu,et al.  Image segmentation by figure-ground composition into maximal cliques , 2011, 2011 International Conference on Computer Vision.

[3]  Jitendra Malik,et al.  Using contours to detect and localize junctions in natural images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[5]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[7]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[8]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[9]  Michael J. Black,et al.  Layered image motion with explicit occlusions, temporal consistency, and depth ordering , 2010, NIPS.

[10]  Longin Jan Latecki,et al.  Maximum weight cliques with mutex constraints for video object segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Sven J. Dickinson,et al.  Spatiotemporal Closure , 2010, ACCV.

[12]  James M. Rehg,et al.  Motion Coherent Tracking with Multi-label MRF optimization , 2010, BMVC.

[13]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[15]  Michael J. Black,et al.  Layered segmentation and optical flow estimation over time , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Cristian Sminchisescu,et al.  Efficient Closed-Form Solution to Generalized Boundary Detection , 2012, ECCV.

[17]  Atsushi Nakazawa,et al.  Motion Coherent Tracking Using Multi-label MRF Optimization , 2012, International Journal of Computer Vision.

[18]  Jason J. Corso,et al.  Propagating multi-class pixel labels throughout video frames , 2010, 2010 Western New York Image Processing Workshop.

[19]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Eric L. Miller,et al.  Multiple Hypothesis Video Segmentation from Superpixel Flows , 2010, ECCV.

[21]  Mohamed R. Amer,et al.  Multiobject tracking as maximum weight independent set , 2011, CVPR 2011.

[22]  Edward H. Adelson,et al.  Human-assisted motion annotation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ignas Budvytis,et al.  Semi-supervised video segmentation using tree structured graphical models , 2011, CVPR.

[24]  Thomas Brox,et al.  Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions , 2011, 2011 International Conference on Computer Vision.

[25]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[26]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[27]  James M. Rehg,et al.  Weakly Supervised Learning of Object Segmentations from Web-Scale Video , 2012, ECCV Workshops.

[28]  Ivan Laptev,et al.  Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.

[29]  James M. Rehg,et al.  Combining Self Training and Active Learning for Video Segmentation , 2011, BMVC.