High Five: Recognising human interactions in TV shows

In this paper we address the problem of recognising interactions between two people in realistic scenarios for video retrieval purposes. We develop a per-person descriptor that uses attention (head orientation) and the local spatial and temporal context in a neighbourhood of each detected person. Using head orientation mitigates camera view ambiguities, while the local context, comprised of histograms of gradients and motion, aims to capture cues such as hand and arm movement. We also employ structured learning to capture spatial relationships between interacting individuals. We train an initial set of one-vs-the-rest linear SVM classifiers, one for each interaction, using this descriptor. Noting that people generally face each other while interacting, we learn a structured SVM that combines head orientation and the relative location of people in a frame to improve upon the initial classification obtained with our descriptor. To test the efficacy of our method, we have created a new dataset of realistic human interactions comprised of clips extracted from TV shows, which represents a very difficult challenge. Our experiments show that using structured learning improves the retrieval results compared to using the interaction classifiers independently.

[1]  Mubarak Shah,et al.  Person-on-person violence detection in video data , 2002, Object recognition supported by user interaction for service robots.

[2]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[3]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[5]  Ian D. Reid,et al.  Estimating Gaze Direction from Low-Resolution Faces in Video , 2006, ECCV.

[6]  Jake K. Aggarwal,et al.  Simultaneous tracking of multiple body parts of interacting persons , 2006, Comput. Vis. Image Underst..

[7]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[8]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Ryo Kurazume,et al.  Learning meaningful interactions from repetitious motion patterns , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Feng Chen,et al.  Hierarchical group process representation in multi-agent activity recognition , 2008, Signal Process. Image Commun..

[12]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Bingbing Ni,et al.  Recognizing human group activities with localized causalities , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Luc Van Gool,et al.  Exemplar-based Action Recognition in Video , 2009, BMVC.

[16]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Andrew Zisserman,et al.  Pose search: Retrieving people using their pose , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Andrew Gilbert,et al.  Fast realistic multi-action recognition using mined dense spatio-temporal features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Yongdong Zhang,et al.  Localizing volumetric motion for action recognition in realistic videos , 2009, MM '09.

[21]  Ian D. Reid,et al.  Guiding Visual Surveillance by Tracking Human Attention , 2009, BMVC.

[22]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[23]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[24]  David Elliott,et al.  In the Wild , 2010 .

[25]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.