Interaction part mining: A mid-level approach for fine-grained action recognition

Modeling human-object interactions and manipulating motions lies in the heart of fine-grained action recognition. Previous methods heavily rely on explicit detection of the object being interacted, which requires intensive human labour on object annotation. To bypass this constraint and achieve better classification performance, in this work, we propose a novel fine-grained action recognition pipeline by interaction part proposal and discriminative mid-level part mining. Firstly, we generate a large number of candidate object regions using off-the-shelf object proposal tool, e.g., BING. Secondly, these object regions are matched and tracked across frames to form a large spatio-temporal graph based on the appearance matching and the dense motion trajectories through them. We then propose an efficient approximate graph segmentation algorithm to partition and filter the graph into consistent local dense sub-graphs. These sub-graphs, which are spatio-temporal sub-volumes, represent our candidate interaction parts. Finally, we mine discriminative mid-level part detectors from the features computed over the candidate interaction parts. Bag-of-detection scores based on a novel Max-N pooling scheme are computed as the action representation for a video sample. We conduct extensive experiments on human-object interaction datasets including MPII Cooking and MSR Daily Activity 3D. The experimental results demonstrate that the proposed framework achieves consistent improvements over the state-of-the-art action recognition accuracies on the benchmarks, without using any object annotation.

[1]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Zoran Zivkovic,et al.  Improved adaptive Gaussian mixture model for background subtraction , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[3]  Konstantin Andreev,et al.  Balanced Graph Partitioning , 2004, SPAA '04.

[4]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Henry A. Kautz,et al.  Fine-grained activity recognition by aggregating abstract object usage , 2005, Ninth IEEE International Symposium on Wearable Computers (ISWC'05).

[6]  Meinard Müller,et al.  Motion templates for automatic classification and retrieval of motion capture data , 2006, SCA '06.

[7]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Danica Kragic,et al.  Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects , 2008, ECCV.

[11]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[12]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[13]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[14]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[16]  Li Fei-Fei,et al.  Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses , 2011 .

[17]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[18]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Dieter Fox,et al.  Fine-grained kitchen activity recognition using RGB-D , 2012, UbiComp.

[20]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Kate Saenko,et al.  A combined pose, object, and feature model for action understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[24]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[25]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Cordelia Schmid,et al.  Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Yiannis Aloimonos,et al.  Detection of Manipulation Action Consequences (MAC) , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Juan Carlos Niebles,et al.  Spatio-temporal Human-Object Interactions for Action Recognition in Videos , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[32]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[34]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[35]  Larry S. Davis,et al.  Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[37]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[38]  Cewu Lu,et al.  Range-Sample Depth Feature for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Bingbing Ni,et al.  Pipelining Localized Semantic Features for Fine-Grained Action Recognition , 2014, ECCV.