DTMMN: Deep transfer multi-metric network for RGB-D action recognition

Abstract In the field of action recognition, fusing multi-modality information is a major research direction. Although many state-of-the-art methods have emerged recently, they lack consideration for the problem of modality deficiency and modality distribution. This paper proposes a novel two-stage method which named deep transfer multi-metric network (DTMMN) to investigate these two questions. First, we train a transfer network to learn the mapping from RGB to depth modality for addressing modality deficiency, and then we use the multi-metric learning to get the relation and differences of feature distribution for actions recognition. We experiment on two challenge datasets and our approach outperforms the state-of-the-art methods on both datasets.

[1]  Guy Marchal,et al.  Multimodality image registration by maximization of mutual information , 1997, IEEE Transactions on Medical Imaging.

[2]  Chunheng Wang,et al.  Deep nonlinear metric learning with independent subspace analysis for face verification , 2012, ACM Multimedia.

[3]  Shiguang Shan,et al.  Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jun Wan,et al.  Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition , 2018, AAAI.

[5]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[7]  Jiwen Lu,et al.  Deep Coupled Metric Learning for Cross-Modal Matching , 2017, IEEE Transactions on Multimedia.

[8]  Dit-Yan Yeung,et al.  Multi-Task Learning in Heterogeneous Feature Spaces , 2011, AAAI.

[9]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[10]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[12]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14]  Jiwen Lu,et al.  Deep Localized Metric Learning , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[16]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[17]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Noah A. Rosenberg,et al.  CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure , 2007, Bioinform..

[19]  Sanja Fidler,et al.  3D Graph Neural Networks for RGBD Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Matthias Nießner,et al.  3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jiwen Lu,et al.  Local Large-Margin Multi-Metric Learning for Face and Kinship Verification , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Patrick Pérez,et al.  Maintaining multimodality through mixture tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Kai Zhao,et al.  Translate-to-Recognize Networks for RGB-D Scene Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jian-Huang Lai,et al.  Deep Bilinear Learning for RGB-D Action Recognition , 2018, ECCV.

[27]  Jian Yang,et al.  Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition , 2018, AAAI.

[28]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yun Fu,et al.  Discriminative Relational Representation Learning for RGB-D Action Recognition , 2016, IEEE Transactions on Image Processing.

[32]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Min Jiang,et al.  RGB-D Action Recognition Using Multimodal Correlative Representation Learning Model , 2019, IEEE Sensors Journal.

[34]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Jiebo Luo,et al.  Heterogeneous feature machines for visual recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Pichao Wang,et al.  Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jiwen Lu,et al.  Sharable and Individual Multi-View Metric Learning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Jiwen Lu,et al.  Discriminative Deep Metric Learning for Face Verification in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Gim Hee Lee,et al.  Transferable Semi-Supervised 3D Object Detection From RGB-D Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Frédéric Jurie,et al.  PCCA: A new approach for distance learning from sparse pairwise constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Ling Shao,et al.  Action Recognition Using 3D Histograms of Texture and A Multi-Class Boosting Classifier , 2017, IEEE Transactions on Image Processing.

[43]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[44]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[45]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[46]  Vittorio Murino,et al.  Modality Distillation with Multiple Stream Networks for Action Recognition , 2018, ECCV.

[47]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Stefan Leutenegger,et al.  X-Section: Cross-Section Prediction for Enhanced RGB-D Fusion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yue Zhao,et al.  PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities , 2018, ECCV.