Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths.

[1]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[2]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ming-Hsuan Yang,et al.  Top-down visual saliency via joint CRF and dictionary learning , 2012, CVPR.

[4]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[6]  Junsong Yuan,et al.  Max-Margin Structured Output Regression for Spatio-Temporal Action Localization , 2012, NIPS.

[7]  Kristen Grauman,et al.  Efficient activity detection with max-subgraph search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[9]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[10]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[11]  Greg Mori,et al.  Similarity Constrained Latent Support Vector Machine: An Application to Weakly Supervised Action Classification , 2012, ECCV.

[12]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[16]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[18]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Junsong Yuan,et al.  Optimal spatio-temporal path discovery for video event detection , 2011, CVPR 2011.

[20]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Christoph H. Lampert,et al.  Learning to Localize Objects with Structured Output Regression , 2008, ECCV.

[22]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[23]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[24]  Bruno A. Olshausen,et al.  Pattern recognition, attention, and information bottlenecks in the primate visual system , 1991, Defense, Security, and Sensing.

[25]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[28]  Richard P. Wildes,et al.  Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.