Track to the future: Spatio-temporal video segmentation with long-range motion cues

Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. We aim to capture such interactions and to construct a powerful intermediate-level video representation for subsequent recognition. Motivated by this goal, we seek to obtain spatio-temporal oversegmentation of a video into regions that respect object boundaries and, at the same time, associate object pixels over many video frames. The contributions of this paper are two-fold. First, we develop an efficient spatiotemporal video segmentation algorithm, which naturally incorporates long-range motion cues from the past and future frames in the form of clusters of point tracks with coherent motion. Second, we devise a new track clustering cost function that includes occlusion reasoning, in the form of depth ordering constraints, as well as motion similarity along the tracks. We evaluate the proposed approach on a challenging set of video sequences of office scenes from feature length movies.

[1]  Edward H. Adelson,et al.  Representing moving images with layers , 1994, IEEE Trans. Image Process..

[2]  Yair Weiss,et al.  Smoothness in layers: Motion segmentation using nonparametric mixture estimation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Jitendra Malik,et al.  Motion segmentation and tracking using normalized cuts , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5]  David J. Fleet,et al.  Probabilistic detection and tracking of motion discontinuities , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[6]  Richard Szeliski,et al.  An integrated Bayesian approach to layer extraction from image sequences , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Marie-Pierre Jolly,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Marie-Pierre Jolly,et al.  Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images , 2001, ICCV.

[9]  Daniel DeMenthon,et al.  SPATIO-TEMPORAL SEGMENTATION OF VIDEO BY HIERARCHICAL MEAN SHIFT ANALYSIS , 2002 .

[10]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Endre Boros,et al.  Pseudo-Boolean optimization , 2002, Discret. Appl. Math..

[12]  David J. Fleet,et al.  Probabilistic Detection and Tracking of Motion Boundaries , 2000, International Journal of Computer Vision.

[13]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Paul Smith,et al.  Layered motion segmentation and depth ordering by tracking edges , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Bo Thiesson,et al.  Image and Video Segmentation by Anisotropic Kernel Mean Shift , 2004, ECCV.

[16]  Robert C. Bolles,et al.  Epipolar-plane image analysis: An approach to determining structure from motion , 1987, International Journal of Computer Vision.

[17]  Patrick Pérez,et al.  Interactive Image Segmentation Using an Adaptive GMMRF Model , 2004, ECCV.

[18]  Takeo Kanade,et al.  A Multibody Factorization Method for Independently Moving Objects , 1998, International Journal of Computer Vision.

[19]  Andrew W. Fitzgibbon,et al.  Learning spatiotemporal T-junctions for occlusion detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Mubarak Shah,et al.  Motion layer extraction in the presence of occlusion using graph cuts , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Andrew Zisserman,et al.  Object Level Grouping for Video Shots , 2004, International Journal of Computer Vision.

[22]  Andrew Zisserman,et al.  Learning Layered Motion Segmentations of Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Nebojsa Jojic,et al.  Consistent segmentation for optical flow estimation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Marc Pollefeys,et al.  A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate , 2006, ECCV.

[26]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Seth J. Teller,et al.  Particle Video: Long-Range Motion Estimation Using Point Trajectories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[29]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[30]  Martial Hebert,et al.  Learning to Find Object Boundaries Using Motion Cues , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  Nikos Komodakis,et al.  Fast, Approximately Optimal Solutions for Single and Dynamic MRFs , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  David Salesin,et al.  Video object annotation, navigation, and composition , 2008, UIST '08.

[33]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Guillermo Sapiro,et al.  Video SnapCut: robust video object cutout using localized classifiers , 2009, SIGGRAPH 2009.

[35]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[36]  William Brendel,et al.  Video object segmentation by tracking regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Philip H. S. Torr,et al.  What, Where and How Many? Combining Object Detectors and CRFs , 2010, ECCV.

[38]  Philip H. S. Torr,et al.  What, Where & How Many? Combining Object Detectors and CRFs , 2010 .

[39]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[40]  Eric L. Miller,et al.  Multiple Hypothesis Video Segmentation from Superpixel Flows , 2010, ECCV.

[41]  Anton Osokin,et al.  Fast Approximate Energy Minimization with Label Costs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[42]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.