论文信息 - SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by 5.6%. With just 60% of the training data, our approach still outperforms the baseline approach by 3.7%. With 40% of the training data, our approach performs comparably well to the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

[1] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[2] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Hong Liu,et al. Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[4] Giorgio Metta,et al. One-Shot Learning for Real-Time Action Recognition , 2013, IbPRIA.

[5] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6] Nasser Kehtarnavaz,et al. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[7] Carlos Medrano,et al. Fast Simplex-HMM for One-Shot Learning Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8] Shaogang Gong,et al. Zero-shot object recognition by semantic manifold distance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Piyush Rai,et al. A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10] Pichao Wang,et al. Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks , 2018, Knowl. Based Syst..

[11] James M. Rehg,et al. Action2Vec: A Crossmodal Embedding Approach to Action Learning , 2019, ArXiv.

[12] John P. Collomosse,et al. Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network , 2017, Comput. Vis. Image Underst..

[13] Alex Bewley,et al. Deep Cosine Metric Learning for Person Re-identification , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15] Gang Wang,et al. Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ser-Nam Lim,et al. A Metric Learning Reality Check , 2020, ECCV.

[17] Jefersson Alex dos Santos,et al. SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[18] William Robson Schwartz,et al. Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints , 2019, 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[19] Brian Hutchinson,et al. Metric-Based Few-Shot Learning for Video Action Recognition , 2019, ArXiv.

[20] Kilian Q. Weinberger,et al. Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[21] Dietrich Paulus,et al. Gimme Signals: Discriminative signal encoding for multimodal activity recognition , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[22] Matthew R. Scott,et al. Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Jun Wan,et al. Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition , 2018, AAAI.

[24] Austin Reiter,et al. Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Shengcai Liao,et al. Deep Metric Learning for Person Re-identification , 2014, 2014 22nd International Conference on Pattern Recognition.

[27] C. Krishna Mohan,et al. Action Recognition Based on Discriminative Embedding of Actions Using Siamese Networks , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[28] Leland McInnes,et al. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[29] Tal Hassner,et al. One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[30] José M. F. Moura,et al. Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[31] Lucas Beyer,et al. In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[32] Yi Gu,et al. Optimizing top precision performance measure of content-based image retrieval by learning similarity function , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[33] Gang Wang,et al. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[34] Pichao Wang,et al. Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[35] Ser-Nam Lim,et al. PyTorch Metric Learning , 2020, ArXiv.

[36] Bhavan Jasani,et al. Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space , 2019, ArXiv.

[37] Dietrich Paulus,et al. Simitate: A Hybrid Imitation Learning Benchmark , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38] Andreas Dengel,et al. Hierarchical Model for Zero-shot Activity Recognition using Wearable Sensors , 2018, ICAART.

[39] Yunde Jia,et al. Content-Attention Representation by Factorized Action-Scene Network for Action Recognition , 2018, IEEE Transactions on Multimedia.

[40] Gang Wang,et al. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Gang Wang,et al. Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.