Learning object class detectors from weakly annotated video

Object detectors are typically trained on a large set of still images annotated by bounding-boxes. This paper introduces an approach for learning object detectors from real-world web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos. To compare to the state of the art, we test our detector on still images, i.e., Pascal VOC 2007. We observe that frames extracted from web videos can differ significantly in terms of quality to still images taken by a good camera. Thus, we formulate the learning from videos as a domain adaptation task. We show that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.

[1]  Andrew Zisserman,et al.  Tabula rasa: Model transfer for object category detection , 2011, 2011 International Conference on Computer Vision.

[2]  Matthew B. Blaschko,et al.  Simultaneous Object Detection and Ranking with Weak Supervision , 2010, NIPS.

[3]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[4]  Jong-Nam Kim,et al.  An adaptive Shot Change Detection algorithm using an average of absolute difference histogram within extension sliding window , 2009, 2009 IEEE 13th International Symposium on Consumer Electronics.

[5]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[6]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[7]  François Fleuret,et al.  FlowBoost — Appearance learning from sparsely annotated video , 2011, CVPR 2011.

[8]  Andrew Zisserman,et al.  An Exemplar Model for Learning Object Classes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[10]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[11]  Binoy Pinto,et al.  Speeded Up Robust Features , 2011 .

[12]  David A. Forsyth,et al.  Building models of animals from video , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Horst Bischof,et al.  On-line Boosting and Vision , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Jiri Matas,et al.  P-N learning: Bootstrapping binary classifiers by structural constraints , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Nebojsa Jojic,et al.  LOCUS: learning object classes with unsupervised segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[16]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Pietro Perona,et al.  A walk through the web’s video clips , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[19]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[21]  David A. Forsyth,et al.  Unsupervised Segmentation of Objects using Efficient Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Yong Jae Lee,et al.  Learning the easy things first: Self-paced visual category discovery , 2011, CVPR 2011.

[25]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[26]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Kristen Grauman,et al.  Large-scale live active learning: Training object detectors with crawled data and crowds , 2011, CVPR.

[28]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[30]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[31]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[32]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[33]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[34]  Horst Bischof,et al.  Improving classifiers with unlabeled weakly-related videos , 2011, CVPR 2011.

[35]  Daniel P. Huttenlocher,et al.  Weakly Supervised Learning of Part-Based Spatial Models for Visual Object Recognition , 2006, ECCV.

[36]  Joachim M. Buhmann,et al.  Seeing the Objects Behind the Dots: Recognition in Videos from a Moving Camera , 2009, International Journal of Computer Vision.

[37]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Barbara Caputo,et al.  Safety in numbers: Learning categories from few examples with multi model knowledge transfer , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[40]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.