论文信息 - Vision, Perception and Multimedia Understanding Weakly Supervised Learning of Interactions between Humans and Objects Weakly Supervised Learning of Interactions between Humans and Objects

Vision, Perception and Multimedia Understanding Weakly Supervised Learning of Interactions between Humans and Objects Weakly Supervised Learning of Interactions between Humans and Objects

We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e. the spatial relation between the human and the object. We compare experimentally to [1] and [2] on the action classification dataset from [1] and also present results on a new human-object interaction dataset. Weakly supervised learning of interactions between humans and objects 3 Figure 1: Example results of our approach showing the automatically detected human (green) and the automatically detected object (pink).

C. Schmid | V. Ferrari | Alessandro Prest

[1] Thomas Deselaers,et al. Localizing Objects While Learning Their Appearance , 2010, ECCV.

[2] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4] Thomas Deselaers,et al. What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5] Fei-Fei Li,et al. Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Nazli Ikizler-Cinbis,et al. Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8] Sebastian Nowozin,et al. On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9] Charless C. Fowlkes,et al. Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10] Richard Johansson,et al. Dependency-based Syntactic–Semantic Analysis with PropBank and NomBank , 2008, CoNLL.

[11] Andrew Zisserman,et al. Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Václav Hlavác,et al. Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Krystian Mikolajczyk,et al. Action recognition with motion-appearance vocabulary forest , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Patrick Pérez,et al. Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16] Vladimir Kolmogorov,et al. Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Cordelia Schmid,et al. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[18] Sébastien Marcel,et al. Local binary patterns as an image preprocessing for face authentication , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[19] Antonio Criminisi,et al. Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20] Serge J. Belongie,et al. Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[21] B. Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22] Pietro Perona,et al. Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23] Stefan Carlsson,et al. Recognizing and Tracking Human Action , 2002, ECCV.

[24] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[25] D. Comaniciu,et al. The variable bandwidth mean shift and data-driven scale selection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[26] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[27] Vittorio Ferrari,et al. Better Appearance Models for Pictorial Structures , 2009, BMVC.

[28] Luc Van Gool,et al. Exemplar-based Action Recognition in Video , 2009, BMVC.

[29] Christopher Hunt. SURF: Speeded-Up Robust Features , 2009 .

[30] Yann Rodriguez,et al. Face detection and verification using local binary patterns , 2006 .

[31] Paul Clough,et al. The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[32] Jianguo Zhang,et al. The PASCAL Visual Object Classes Challenge , 2006 .