Learning bag of spatio-temporal features for human interaction recognition

Bag of Visual Words Model (BoVW) has achieved impressive performance on human activity recognition. However, it is extremely difficult to capture high-level semantic meanings behind video features with this method as the spatiotemporal distribution of visual words is ignored, preventing localization of the interactions within a video. In this paper, we propose a supervised learning framework that automatically recognizes high-level human interaction based on a bag of spatiotemporal visual features. At first, a representative baseline keyframe that captures the major body parts of the interacting persons is selected and the bounding boxes containing persons are extracted to parse the poses of all persons in the interaction. Based on this keyframe, features are detected by combining edge features and Maximally Stable Extremal Regions (MSER) features for each interacting person and backward-forward tracked over the entire video sequence. Based on feature tracks, 3D XYT spatiotemporal volumes are generated for each interacting target. Then, the K-means algorithm is used to build a codebook of visual features to represent a given interaction. The interaction is then represented by the sum of the frequency occurrence of visual words between persons. Extensive experimental evaluations on the UT-interaction dataset demonstrate the strength of our method to recognize the ongoing interactions from videos with a simple implementation.

[1]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[2]  Chalavadi Krishna Mohan,et al.  Human action recognition in RGB-D videos using motion sequence information and deep learning , 2017, Pattern Recognit..

[3]  Mubarak Shah,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Ajmal Mian,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016 .

[6]  Maria Mahmood,et al.  Students’ behavior mining in e-learning environment using cognitive processes with information technologies , 2019, Education and Information Technologies.

[7]  Luc Van Gool,et al.  Variations of a Hough-Voting Action Recognition System , 2010, ICPR Contests.

[8]  Jiri Matas,et al.  Forward-Backward Error: Automatic Detection of Tracking Failures , 2010, 2010 20th International Conference on Pattern Recognition.

[9]  David Nistér,et al.  Linear Time Maximally Stable Extremal Regions , 2008, ECCV.

[10]  Snehasis Mukherjee,et al.  Recognizing interaction between human performers using 'key pose doublet' , 2011, ACM Multimedia.

[11]  Changhui Wang,et al.  Multiple Feature Voting based Human Interaction Recognition , 2016 .

[12]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Zhaojie Ju,et al.  A New Framework of Human Interaction Recognition Based on Multiple Stage Probability Fusion , 2017 .

[14]  Wei Liang,et al.  Recognising human interaction from videos by a discriminative model , 2014, IET Comput. Vis..

[15]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[16]  Tae-Kyun Kim,et al.  Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests , 2010, BMVC.

[17]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[18]  Lei Chen,et al.  Deep Structured Models For Group Activity Recognition , 2015, BMVC.

[19]  Vladimir Pavlovic,et al.  A New Adaptive Segmental Matching Measure for Human Activity Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Changhui Wang,et al.  A Simple Human Interaction Recognition Based on Global GIST Feature Model , 2015, ICIRA.

[21]  Ahmad Jalal,et al.  Robust Spatio-Temporal Features for Human Interaction Recognition Via Artificial Neural Network , 2018, 2018 International Conference on Frontiers of Information Technology (FIT).

[22]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[23]  Aun Irtaza,et al.  Robust Human Activity Recognition Using Multimodal Feature-Level Fusion , 2019, IEEE Access.

[24]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[26]  Mohamed R. Amer,et al.  A chains model for localizing participants of group activities in videos , 2011, 2011 International Conference on Computer Vision.

[27]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[28]  Ahmet Burak Can,et al.  Depth features to recognise dyadic interactions , 2018, IET Comput. Vis..

[29]  Hong Liu,et al.  Sequential Bag-of-Words model for human action classification , 2016, CAAI Trans. Intell. Technol..

[30]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yannick Benezeth,et al.  Human Interaction Recognition Based on the Co-occurrence of Visual Words , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[32]  Jean-Christophe Nebel,et al.  Recognition of Activities of Daily Living with Egocentric Vision: A Review , 2016, Sensors.

[33]  Rama Chellappa,et al.  Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics , 2012, International Journal of Computer Vision.

[34]  David Obdrzálek,et al.  Detecting Scene Elements Using Maximally Stable Colour Regions , 2009, Eurobot Conference.

[35]  Jong-Woo Lee,et al.  Efficient Class-Incremental Learning Based on Bag-of-Sequencelets Model for Activity Recognition , 2019, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[36]  Plamen Angelov,et al.  Vision Based Human Activity Recognition: A Review , 2016, UKCI.

[37]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[38]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[39]  Shishir K. Shah,et al.  Human Activity Recognition using Deep Neural Network with Contextual Information , 2017, VISIGRAPP.

[40]  Neil M. Robertson,et al.  Deep Convolutional Poses for Human Interaction Recognition in Monocular Videos , 2016, ArXiv.

[41]  Chiranjoy Chattopadhyay,et al.  Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions , 2016, IET Comput. Vis..