Understanding human activities in videos: A joint action and interaction learning approach

Abstract In video surveillance with multiple people, human interactions and their action categories preserve strong correlations, and the identification of interaction configuration is of significant importance to the success of action recognition task. Interactions are typically estimated using heuristics or treated as latent variables. However, the former usually introduces incorrect interaction configuration while the latter amounts to solve challenging optimization problems. Here we address these problems systematically by proposing a novel structured learning framework which enables the joint prediction of actions and interactions. To this end, both the features learned via deep nets and human interaction context are leveraged to encode the correlations among actions and pairwise interactions in a structured model, and all model parameters are trained via a large-margin framework. To solve the associated inference problem, we present two optimization algorithms, one is alternating search and the other is belief propagation. Experiments on both synthetic and real dataset demonstrate the strength of the proposed approach.

[1]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[2]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[4]  Alexander C. Berg,et al.  Combining multiple sources of knowledge in deep CNNs for action recognition , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Heng Tao Shen,et al.  Unsupervised Deep Hashing with Similarity-Adaptive and Discrete Optimization , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Zhenhua Wang,et al.  Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yongdong Zhang,et al.  Efficient Parallel Framework for HEVC Motion Estimation on Many-Core Processors , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Tommi S. Jaakkola,et al.  Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations , 2007, NIPS.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[11]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[12]  Sheng Liu,et al.  Joint label-interaction learning for human action recognition , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[13]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[14]  Shigeyuki Odashima,et al.  Consistent collective activity recognition with fully connected CRFs , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[15]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[16]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[17]  Yongdong Zhang,et al.  Supervised Hash Coding With Deep Neural Network for Environment Perception of Intelligent Vehicles , 2018, IEEE Transactions on Intelligent Transportation Systems.

[18]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Damith Chinthana Ranasinghe,et al.  Efficient dense labelling of human activity sequences from wearables using fully convolutional networks , 2018, Pattern Recognit..

[20]  In-So Kweon,et al.  Real-Time Head Orientation from a Monocular Camera Using Deep Neural Network , 2014, ACCV.

[21]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[24]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[25]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[26]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Mohamed R. Amer,et al.  HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos , 2014, ECCV.

[28]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[30]  Thierry Artières,et al.  Large margin training for hidden Markov models with partially observed states , 2009, ICML '09.

[31]  Yongdong Zhang,et al.  A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors , 2014, IEEE Signal Processing Letters.

[32]  Nikos Komodakis,et al.  MRF Energy Minimization and Beyond via Dual Decomposition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Zhenhua Wang,et al.  A Spatio-Temporal CRF for Human Interaction Understanding , 2017, IEEE Transactions on Circuits and Systems for Video Technology.