Discriminative Segment Annotation in Weakly Labeled Video

The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.

[1]  Matti Pietikäinen,et al.  Performance evaluation of texture measures with classification based on Kullback discrimination of distributions , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[2]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[3]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[4]  Mubarak Shah,et al.  Motion layer extraction in the presence of occlusion using graph cuts , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  David A. Forsyth,et al.  Building models of animals from video , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Thomas Hofmann,et al.  Multi-Instance Multi-Label Learning with Application to Scene Classification , 2007 .

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  Bohyung Han,et al.  Extracting Moving People from Internet Videos , 2008, ECCV.

[11]  Tao Mei,et al.  Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Joachim M. Buhmann,et al.  Seeing the Objects Behind the Dots: Recognition in Videos from a Moving Camera , 2009, International Journal of Computer Vision.

[13]  Yoram Singer,et al.  Boosting with structural sparsity , 2009, ICML '09.

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[15]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Shuicheng Yan,et al.  An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Vladimir Kolmogorov,et al.  Cosegmentation Revisited: Models and Optimization , 2010, ECCV.

[18]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[19]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[20]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[21]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  François Fleuret,et al.  FlowBoost — Appearance learning from sparsely annotated video , 2011, CVPR 2011.

[23]  Ivan Laptev,et al.  Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.

[24]  Horst Bischof,et al.  Improving classifiers with unlabeled weakly-related videos , 2011, CVPR 2011.

[25]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[26]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[27]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  James M. Rehg,et al.  Weakly Supervised Learning of Object Segmentations from Web-Scale Video , 2012, ECCV Workshops.

[29]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Tao Xiang,et al.  In Defence of Negative Mining for Annotating Weakly Labelled Data , 2012, ECCV.

[31]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[32]  Fei-Fei Li,et al.  Shifting Weights: Adapting Object Detectors from Image to Video , 2012, NIPS.

[33]  Joachim M. Buhmann,et al.  Weakly supervised structured output learning for semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.