A Multimodal Pairwise Discrimination Network for Cross-Domain Action Recognition

In recent years, action recognition has become a hot research topic in the computer vision and machine learning domain. Despite many well-designed action recognition approaches have been proposed, we point out that some limitations still exist including the separated fusion of different Spatio-temporal features and the reconstruction classification model, and the requirement of similar environmental conditions when capturing the training and testing data. Thus, research interest has shifted from traditional action recognition towards cross-domain action recognition. To solve these limitations, in this work, we propose a novel multimodal pairwise discrimination network (short for MPD) for cross-domain action recognition that is an end-to-end network architecture. In MPD, it can jointly fuse different Spatio-temporal features from the video, learn domain invariant features for different action domains (source and target domains), and build the classification model. To characterize the shift between these domains, subnetwork parameters in corresponding layers of MPD are required to be relevant, but not identical. Besides, the domain invariant feature discrimination needs to be improved. Extensive experimental results on two different public benchmarks including indoor environment and outdoor environment demonstrate that our MPD solution can significantly outperform state-of-the-art methods with a 4% to 20% improvement in average accuracy.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[7]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  Ivor W. Tsang,et al.  Domain Transfer Multiple Kernel Learning , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yanbing Xue,et al.  MMA: a multi-view and multi-modality benchmark dataset for human action recognition , 2018, Multimedia Tools and Applications.

[10]  Richa Singh,et al.  Anonymizing k-Facial Attributes via Adversarial Perturbations , 2018, IJCAI.

[11]  Philip S. Yu,et al.  Transfer Joint Matching for Unsupervised Domain Adaptation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[13]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Philip S. Yu,et al.  Transfer Feature Learning with Joint Distribution Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[18]  Juan Carlos Niebles,et al.  Adversarial Cross-Domain Action Recognition with Co-Attention , 2019, AAAI.

[19]  Kim-Kwang Raymond Choo,et al.  Adaptive Fusion and Category-Level Dictionary Learning Model for Multiview Human Action Recognition , 2019, IEEE Internet of Things Journal.

[20]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[21]  Barbara Caputo,et al.  Frustratingly Easy NBNN Domain Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  K. X Xue,et al.  Multiple Discrimination and Pairwise CNN for View-based 3D Object Retrieval , 2020, Neural Networks.

[23]  K. S. Venkatesh,et al.  Deep Domain Adaptation in Action Space , 2018, BMVC.

[24]  Dima Damen,et al.  DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ke Lu,et al.  Transfer Independently Together: A Generalized Framework for Domain Adaptation , 2019, IEEE Transactions on Cybernetics.

[26]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[27]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[29]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[32]  Qionghai Dai,et al.  View-based 3-D Object Retrieval , 2014 .

[33]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[34]  Jan Kautz,et al.  STEP: Spatio-Temporal Progressive Learning for Video Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Gabriela Csurka,et al.  A Comprehensive Survey on Domain Adaptation for Visual Applications , 2017, Domain Adaptation in Computer Vision Applications.

[37]  Petros Daras,et al.  Cross-domain Knowledge Transfer Schemes for 3D Human Action Recognition , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[38]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[39]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[40]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Farbod Razzazi,et al.  Multi-stream 3D CNN structure for human action recognition trained by limited data , 2019, IET Comput. Vis..

[42]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Juergen Gall,et al.  Open Set Domain Adaptation for Image and Action Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yanning Zhang,et al.  Video Action Recognition Based on Deeper Convolution Networks with Pair-Wise Frame Motion Concatenation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  An-An Liu,et al.  Multi-Domain and Multi-Task Learning for Human Action Recognition , 2019, IEEE Transactions on Image Processing.

[46]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[47]  Pascal Fua,et al.  Beyond Sharing Weights for Deep Domain Adaptation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Ke Lu,et al.  Heterogeneous Domain Adaptation Through Progressive Alignment , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[49]  Chao Li,et al.  Collaborative Spatiotemporal Feature Learning for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Lingfeng Wang,et al.  Weakly Semantic Guided Action Recognition , 2019, IEEE Transactions on Multimedia.

[51]  Gang Wang,et al.  Solution Path for Manifold Regularized Semisupervised Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[52]  Ruxin Chen,et al.  Temporal Attentive Alignment for Large-Scale Video Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[55]  Dong Liang,et al.  Cross-domain action recognition via collective matrix factorization with graph Laplacian regularization , 2016, Image Vis. Comput..

[56]  Lei Zhu,et al.  Exploring the Cross-Domain Action Recognition Problem by Deep Feature Learning and Cross-Domain Learning , 2018, IEEE Access.

[57]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.