Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions

This study presents supervised framework for automatic recognition and retrieval of interactions (SAFARRIs), a supervised learning framework to recognise interactions such as pushing, punching, and hugging, between a pair of human performers in a video shot. The primary contribution of the study is to extend the vectors of locally aggregated descriptors (VLADs) as a compact and discriminative video encoding representation, to solve the complex class partitioning problem of recognising human interaction. An initial codebook is generated from the training set of video shots, by extracting feature descriptors around the spatiotemporal interest points computed across frames. A bag of action words is generated by encoding the first-order statistics of the visual words using VLAD. Support vector machine classifiers (1 against all) are trained using these codebooks. The authors have verified SAFARRI's accuracy for classification and retrieval (query by example). SAFARRI is free from tracking or recognition of body parts and capable of identifying the region of interaction in video shots. It gives superior retrieval and classification performances over recently proposed methods, on two publicly available human interaction datasets.

[1]  Rama Chellappa,et al.  Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics , 2012, International Journal of Computer Vision.

[2]  Chang Dong Yoo,et al.  Robust video fingerprinting for content-based video identification , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Ling Shao,et al.  Content-based retrieval of human actions from realistic video databases , 2013, Inf. Sci..

[4]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Klamer Schutte,et al.  Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos , 2013, Machine Vision and Applications.

[6]  Changxin Gao,et al.  A hierarchical feature graph matching method for recognition of complex human activities , 2014 .

[7]  Klamer Schutte,et al.  Spatio-temporal layout of human actions for improved bag-of-words action detection , 2013, Pattern Recognit. Lett..

[8]  Amit K. Roy-Chowdhury,et al.  Modeling multi-object interactions using "string of feature graphs" , 2013, Comput. Vis. Image Underst..

[9]  Mubarak Shah,et al.  Content based video matching using spatiotemporal volumes , 2008, Comput. Vis. Image Underst..

[10]  Wei Liang,et al.  Recognising human interaction from videos by a discriminative model , 2014, IET Comput. Vis..

[11]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[13]  Hichem Sahbi,et al.  Mid-level features and spatio-temporal context for activity recognition , 2012, Pattern Recognit..

[14]  Jean-Christophe Nebel,et al.  Common-sense reasoning for human action recognition , 2013, Pattern Recognit. Lett..

[15]  A. Enis Çetin,et al.  Content-based video copy detection based on motion vectors estimated using a lower frame rate , 2014, Signal Image Video Process..

[16]  Youtian Du,et al.  Human Interaction Representation and Recognition Through Motion Decomposition , 2007, IEEE Signal Processing Letters.

[17]  Bin Liang,et al.  Design of Video Retrieval System Using MPEG-7 Descriptors , 2012 .

[18]  Manuel J. Marín-Jiménez,et al.  Exploring STIP-based models for recognizing human interactions in TV videos , 2013, Pattern Recognit. Lett..

[19]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.