Discriminative body part interaction mining for mid-level action representation and classification

Abstract In this paper, we propose a novel mid-level feature representation for the recognition of actions in videos. This descriptor proves to posses relevant discriminative power when used in a generic action recognition pipeline. It is well known that mid-level feature descriptors learnt using class-oriented information are potentially more distinctive than the low-level features extracted in a bottom-up unsupervised fashion. In this regard, we introduce the notion of concepts, a mid-level feature representation capable of tracking the dynamics of motion salient regions over consecutive frames in a video sequence. Our feature representation is based on the idea of region correspondence over consecutive frames and we make use of an unsupervised iterative bipartite graph matching algorithm to extract representative visual concepts from action videos. The progression of such salient regions, which are also consistent in appearance, are henceforth represented as chain graphs. Finally, we adopt an intuitive time-series pooling strategy to extract discriminant features from the chains, which are then used in a dictionary learning based classification framework. Given the high variability of the movements of different human body parts in separate actions, the extracted conceptual descriptors are proved to capture the different dynamic characteristics by exclusively encoding the interaction parts associated to the chains. Further, we use such descriptors in a semi-supervised, clustering-based zero-shot action recognition setting, showing good performance and without resorting to costly attribute annotation. We validate the proposed framework on four public datasets namely KTH, UCF-101, HOHA and HMDB-51, reporting increased (and comparable in some cases) classification accuracies with respect to the state of the art.

[1]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Lorenzo Torresani,et al.  EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis , 2016, International Journal of Computer Vision.

[6]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[7]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[8]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[9]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[11]  Yang Yi,et al.  Human action recognition with graph-based multiple-instance learning , 2016, Pattern Recognit..

[12]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Harold N. Gabow Scaling Algorithms for Network Problems , 1985, J. Comput. Syst. Sci..