Domain learning joint with semantic adaptation for human action recognition

Abstract Action recognition is a challenging task in the field of computer vision. The deficiency in training samples is a bottleneck problem in the current action recognition research. With the explosive growth of Internet data, some researchers try to use prior knowledge learned from various video sources to assist in recognizing the action video of the target domain, which is called knowledge adaptation. Based on this idea, we propose a novel framework for action recognition, called Semantic Adaptation based on the Vector of Locally Max Pooled deep learned Features (SA-VLMPF). The proposed framework consists of three parts: Two-Stream Fusion Network (TSFN), Vector of Locally Max-Pooled deep learned Features (VLMPF) and Semantic Adaptation Model (SAM). TSFN adopts a cascaded convolution fusion strategy to combine the convolutional features extracted from two-stream network. VLMPF retains the long-term information in videos and removes the irrelevant information by capturing multiple local features and extracting the features with the highest response to action category. SAM first maps the data of the auxiliary domain and the target domain into the high-level semantic representation through the deep network. Then the obtained high-level semantic representations from auxiliary domain are adapted into target domain in order to optimize the target classifier. Compared with the existing methods, the proposed methods can utilize the advantages of deep learning methods in obtaining the high-level semantic information to improve the performance of knowledge adaptation. At the same time, SA-VLMPF can make full use of the auxiliary data to make up for the insufficiency of training samples. Multiple experiments are conducted on several couples of datasets to validate the effectiveness of the proposed framework. The results show that the proposed SA-VLMPF outperforms the state-of-the-art knowledge adaptation methods.

[1]  Nicu Sebe,et al.  Knowledge Adaptation with PartiallyShared Features for Event DetectionUsing Few Exemplars , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Philip S. Yu,et al.  Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Dacheng Tao,et al.  This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS 1 Cross-Domain Human Action Recognition , 2022 .

[4]  Cordelia Schmid,et al.  Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[5]  Barbara Caputo,et al.  Multiclass transfer learning from unconstrained priors , 2011, 2011 International Conference on Computer Vision.

[6]  Limin Wang,et al.  Latent Hierarchical Model of Temporal Structure for Complex Activity Classification , 2014, IEEE Transactions on Image Processing.

[7]  Chong-Wah Ngo,et al.  Semantic context transfer across heterogeneous sources for domain adaptive video search , 2009, ACM Multimedia.

[8]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[9]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[10]  Yonina C. Eldar,et al.  Average Case Analysis of Multichannel Sparse Recovery Using Convex Relaxation , 2009, IEEE Transactions on Information Theory.

[11]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[12]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[13]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Yi Yang,et al.  Image Attribute Adaptation , 2014, IEEE Transactions on Multimedia.

[15]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[16]  Nicu Sebe,et al.  Exploiting the entire feature space with sparsity for automatic image annotation , 2011, ACM Multimedia.

[17]  Ivor W. Tsang,et al.  Learning with Augmented Features for Heterogeneous Domain Adaptation , 2012, ICML.

[18]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Luc Van Gool,et al.  Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[23]  Hao Yang,et al.  Time-Asymmetric 3d Convolutional Neural Networks for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[24]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[25]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[28]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[29]  Changyin Sun,et al.  Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection , 2014, IEEE Transactions on Image Processing.

[30]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[32]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Haifeng Hu,et al.  Residual Gating Fusion Network for Human Action Recognition , 2018, CCBR.

[34]  Qinghua Hu,et al.  Semi-Supervised Image-to-Video Adaptation for Video Action Recognition , 2017, IEEE Transactions on Cybernetics.

[35]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[36]  Zi Huang,et al.  Transfer tagging from image to video , 2011, ACM Multimedia.

[37]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[39]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[40]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Haifeng Shen,et al.  Deep Unsupervised Domain Adaptation for Face Recognition , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[42]  Anuj Srivastava,et al.  Action Recognition Using Rate-Invariant Analysis of Skeletal Shape Trajectories , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Haifeng Hu,et al.  Deep Spatiotemporal Relation Learning With 3D Multi-Level Dense Fusion for Video Action Recognition , 2019, IEEE Access.

[44]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[45]  Jianxin Wu,et al.  Good Practices for Learning to Recognize Actions Using FV and VLAD , 2016, IEEE Transactions on Cybernetics.

[46]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[48]  Xiangtao Zheng,et al.  A discriminative representation for human action recognition , 2016, Pattern Recognit..

[49]  Chen Zhang,et al.  Semi-supervised domain adaptation via Fredholm integral based kernel methods , 2019, Pattern Recognit..

[50]  Cordelia Schmid,et al.  Learning Object Representations for Visual Object Class Recognition , 2007, ICCV 2007.

[51]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[56]  Ricardo da Silva Torres,et al.  Semi-supervised transfer subspace for domain adaptation , 2018, Pattern Recognit..

[57]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Joan Climent,et al.  Human action recognition by means of subtensor projections and dense trajectories , 2018, Pattern Recognit..