Vision, Perception and Multimedia Understanding Weakly Supervised Learning of Interactions between Humans and Objects Weakly Supervised Learning of Interactions between Humans and Objects

We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e. the spatial relation between the human and the object. We compare experimentally to [1] and [2] on the action classification dataset from [1] and also present results on a new human-object interaction dataset. Weakly supervised learning of interactions between humans and objects 3 Figure 1: Example results of our approach showing the automatically detected human (green) and the automatically detected object (pink).

[1]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[2]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Nazli Ikizler-Cinbis,et al.  Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Richard Johansson,et al.  Dependency-based Syntactic–Semantic Analysis with PropBank and NomBank , 2008, CoNLL.

[11]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Krystian Mikolajczyk,et al.  Action recognition with motion-appearance vocabulary forest , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[18]  Sébastien Marcel,et al.  Local binary patterns as an image preprocessing for face authentication , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[19]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[21]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  Stefan Carlsson,et al.  Recognizing and Tracking Human Action , 2002, ECCV.

[24]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[25]  D. Comaniciu,et al.  The variable bandwidth mean shift and data-driven scale selection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[26]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[27]  Vittorio Ferrari,et al.  Better Appearance Models for Pictorial Structures , 2009, BMVC.

[28]  Luc Van Gool,et al.  Exemplar-based Action Recognition in Video , 2009, BMVC.

[29]  Christopher Hunt SURF: Speeded-Up Robust Features , 2009 .

[30]  Yann Rodriguez,et al.  Face detection and verification using local binary patterns , 2006 .

[31]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[32]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .