Human Interaction Recognition Framework based on Interacting Body Part Attention

Human activity recognition in videos has been widely studied and has recently gained significant advances with deep learning approaches; however, it remains a challenging task. In this paper, we propose a novel framework that simultaneously considers both implicit and explicit representations of human interactions by fusing information of local image where the interaction actively occurred, primitive motion with the posture of individual subject’s body parts, and the co-occurrence of overall appearance change. Human interactions change, depending on how the body parts of each human interact with the other. The proposed method captures the subtle difference between different interactions using interacting body part attention. Semantically important body parts that interact with other objects are given more weight during feature representation. The combined feature of interacting body part attention-based individual representation and the co-occurrence descriptor of the full-body appearance change is fed into long short-term memory to model the temporal dynamics over time in a single framework. We validate the effectiveness of the proposed method using four widely used public datasets by outperforming the competing stateof-the-art method.

[1]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[5]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Jian Yang,et al.  Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yun Fu,et al.  Close Human Interaction Recognition Using Patch-Aware Models , 2016, IEEE Transactions on Image Processing.

[12]  Anil K. Jain,et al.  Nighttime face recognition at large standoff: Cross-distance and cross-spectral matching , 2014, Pattern Recognit..

[13]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[14]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[15]  Honghai Liu,et al.  RGB-D sensing based human action and interaction analysis: A survey , 2019, Pattern Recognit..

[16]  Dong-Gyu Lee,et al.  Discriminative context learning with gated recurrent unit for group activity recognition , 2018, Pattern Recognit..

[17]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Jinhui Tang,et al.  Coherence Constrained Graph LSTM for Group Activity Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Greg Mori,et al.  Discriminative key-component models for interaction detection and recognition , 2015, Comput. Vis. Image Underst..

[20]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[21]  Farooq Ahmad,et al.  Recognizing Human Activities From Video Using Weakly Supervised Contextual Features , 2019, IEEE Access.

[22]  Seong-Whan Lee,et al.  View-independent human action recognition with Volume Motion Template on single stereo camera , 2010, Pattern Recognit. Lett..

[23]  Qiang Ji,et al.  A Hierarchical Context Model for Event Recognition in Surveillance Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[26]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[27]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[29]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Anil K. Jain,et al.  Face Tracking and Recognition at a Distance: A Coaxial and Concentric PTZ Camera System , 2013, IEEE Transactions on Information Forensics and Security.

[31]  Jihun Park,et al.  Accurate object contour tracking based on boundary edge selection , 2007, Pattern Recognit..

[32]  Stephen J. Maybank,et al.  Feedback Graph Convolutional Network for Skeleton-Based Action Recognition , 2020, IEEE Transactions on Image Processing.

[33]  Seong-Whan Lee,et al.  Reconstruction of 3D human body pose from stereo image sequences based on top-down learning , 2007, Pattern Recognit..

[34]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  S-W Lee,et al.  Biologically Motivated Computer Vision , 2000, Lecture Notes in Computer Science.

[37]  Dong-Gyu Lee,et al.  Prediction of partially observed human activity based on pre-trained deep representation , 2019, Pattern Recognit..

[38]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Seong-Whan Lee,et al.  Facial component extraction and face recognition with support vector machines , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[40]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[41]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Yan Song,et al.  Concurrence-Aware Long Short-Term Sub-Memories for Person-Person Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ahmad Jalal,et al.  Robust Spatio-Temporal Features for Human Interaction Recognition Via Artificial Neural Network , 2018, 2018 International Conference on Frontiers of Information Technology (FIT).

[45]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[46]  Robert Bergevin,et al.  Semantic human activity recognition: A literature review , 2015, Pattern Recognit..

[47]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Amit K. Roy-Chowdhury,et al.  Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[50]  Khadidja Nour el houda Slimani,et al.  Learning bag of spatio-temporal features for human interaction recognition , 2020, International Conference on Machine Vision.

[51]  Honghai Liu,et al.  A structured multi-feature representation for recognizing human action and interaction , 2018, Neurocomputing.

[52]  Qiang Ji,et al.  Hierarchical Context Modeling for Video Event Recognition , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Honghai Liu,et al.  HDS-SP: A novel descriptor for skeleton-based human action recognition , 2020, Neurocomputing.

[54]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.