SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by 5.6%. With just 60% of the training data, our approach still outperforms the baseline approach by 3.7%. With 40% of the training data, our approach performs comparably well to the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.

[1]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[4]  Giorgio Metta,et al.  One-Shot Learning for Real-Time Action Recognition , 2013, IbPRIA.

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[7]  Carlos Medrano,et al.  Fast Simplex-HMM for One-Shot Learning Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Shaogang Gong,et al.  Zero-shot object recognition by semantic manifold distance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Piyush Rai,et al.  A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks , 2018, Knowl. Based Syst..

[11]  James M. Rehg,et al.  Action2Vec: A Crossmodal Embedding Approach to Action Learning , 2019, ArXiv.

[12]  John P. Collomosse,et al.  Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network , 2017, Comput. Vis. Image Underst..

[13]  Alex Bewley,et al.  Deep Cosine Metric Learning for Person Re-identification , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ser-Nam Lim,et al.  A Metric Learning Reality Check , 2020, ECCV.

[17]  Jefersson Alex dos Santos,et al.  SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[18]  William Robson Schwartz,et al.  Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints , 2019, 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[19]  Brian Hutchinson,et al.  Metric-Based Few-Shot Learning for Video Action Recognition , 2019, ArXiv.

[20]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[21]  Dietrich Paulus,et al.  Gimme Signals: Discriminative signal encoding for multimodal activity recognition , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[22]  Matthew R. Scott,et al.  Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jun Wan,et al.  Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition , 2018, AAAI.

[24]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shengcai Liao,et al.  Deep Metric Learning for Person Re-identification , 2014, 2014 22nd International Conference on Pattern Recognition.

[27]  C. Krishna Mohan,et al.  Action Recognition Based on Discriminative Embedding of Actions Using Siamese Networks , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[28]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[29]  Tal Hassner,et al.  One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[30]  José M. F. Moura,et al.  Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[31]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[32]  Yi Gu,et al.  Optimizing top precision performance measure of content-based image retrieval by learning similarity function , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[33]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[34]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[35]  Ser-Nam Lim,et al.  PyTorch Metric Learning , 2020, ArXiv.

[36]  Bhavan Jasani,et al.  Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space , 2019, ArXiv.

[37]  Dietrich Paulus,et al.  Simitate: A Hybrid Imitation Learning Benchmark , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  Andreas Dengel,et al.  Hierarchical Model for Zero-shot Activity Recognition using Wearable Sensors , 2018, ICAART.

[39]  Yunde Jia,et al.  Content-Attention Representation by Factorized Action-Scene Network for Action Recognition , 2018, IEEE Transactions on Multimedia.

[40]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.