Unsupervised Object Discovery and Tracking in Video Collections

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision. We formulate the problem as a combination of two complementary processes: discovery and tracking. The first one establishes correspondences between prominent regions across videos, and the second one associates similar object regions within the same video. Interestingly, our algorithm also discovers the implicit topology of frames associated with instances of the same object class across different videos, a role normally left to supervisory information in the form of class labels in conventional image and video understanding methods. Indeed, as demonstrated by our experiments, our method can handle video collections featuring multiple object classes, and substantially outperforms the state of the art in colocalization, even though it tackles a broader problem with much less supervision.

[1]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[2]  Leonidas J. Guibas,et al.  Image Co-segmentation via Consistent Functional Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[4]  Antonio Torralba,et al.  Unsupervised Detection of Regions of Interest Using Iterative Link Analysis , 2009, NIPS.

[5]  Luc Van Gool,et al.  Online Multiperson Tracking-by-Detection from a Single, Uncalibrated Camera , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Santiago Manen,et al.  Prime Object Proposals with Randomized Prim's Algorithm , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Tao Xiang,et al.  Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Michal Irani,et al.  “Clustering by Composition”—Unsupervised Discovery of Image Categories , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Bohyung Han,et al.  Generalized background subtraction based on hybrid inference by belief propagation and Bayesian filtering , 2011, 2011 International Conference on Computer Vision.

[13]  Ian D. Reid,et al.  Stable multi-target tracking in real-time surveillance video , 2011, CVPR 2011.

[14]  Fei-Fei Li,et al.  Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[16]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[17]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Tinne Tuytelaars,et al.  Video object proposals , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Thomas Brox,et al.  Higher order motion models and spectral clustering , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xiaoqing Ding,et al.  Detecting Human Action as the Spatio-Temporal Tube of Maximum Mutual Information , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Seunghoon Hong,et al.  Orderless Tracking through Model-Averaged Posterior Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Cordelia Schmid,et al.  Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Gang Yu,et al.  Propagative Hough Voting for Human Activity Recognition , 2012, ECCV.

[27]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Thomas Brox,et al.  Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions , 2011, 2011 International Conference on Computer Vision.

[29]  Trevor Darrell,et al.  Unsupervised Learning of Categories from Sets of Partially Matching Image Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[31]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[32]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[34]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Jitendra Malik,et al.  Discriminative Decorrelation for Clustering and Classification , 2012, ECCV.

[36]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[37]  Ramakant Nevatia,et al.  Video segmentation with spatio-temporal tubes , 2013, 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[38]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Fei-Fei Li,et al.  Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[40]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[41]  Fei-Fei Li,et al.  Co-localization in Real-World Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Leonidas J. Guibas,et al.  Unsupervised Multi-class Joint Image Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.