论文信息 - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.

[1] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Yu-Gang Jiang,et al. Image Block Augmentation for One-Shot Learning , 2019, AAAI.

[3] Martial Hebert,et al. Low-Shot Learning from Imaginary Data , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Joshua B. Tenenbaum,et al. One shot learning of simple visual concepts , 2011, CogSci.

[5] Tao Xiang,et al. Long-Term Cloth-Changing Person Re-identification , 2020, ACCV.

[6] Larry S. Davis,et al. Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Rogério Schmidt Feris,et al. Delta-encoder: an effective sample synthesis method for few-shot object recognition , 2018, NeurIPS.

[9] Yali Wang,et al. PA3D: Pose-Action 3D Machine for Video Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Bharath Hariharan,et al. Low-Shot Visual Recognition by Shrinking and Hallucinating Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[11] Cees Snoek,et al. What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Chen Gao,et al. Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition , 2019, NeurIPS.

[13] Shih-Fu Chang,et al. Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks , 2018, NeurIPS.

[14] Gregory R. Koch,et al. Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[15] Yu-Gang Jiang,et al. Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Juan Carlos Niebles,et al. Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[19] Martial Hebert,et al. Learning to Learn: Model Regression Networks for Easy Small Sample Learning , 2016, ECCV.

[20] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Li Zhang,et al. Learning to Learn: Meta-Critic Networks for Sample Efficient Learning , 2017, ArXiv.

[22] Jan Kautz,et al. Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[23] Yi Yang,et al. Random Erasing Data Augmentation , 2017, AAAI.

[24] Nazli Ikizler-Cinbis,et al. Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[25] Kimiaki Shirahama,et al. Example-Based 3D Trajectory Extraction of Objects From 2D Videos , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[26] Yanwei Fu,et al. An Embarrassingly Simple Baseline to One-shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[28] Taesung Park,et al. Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[30] Cordelia Schmid,et al. Actions in context , 2009, CVPR.

[31] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[33] Cordelia Schmid,et al. Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[35] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[36] Leon A. Gatys,et al. Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Xinjun Sheng,et al. Shared control of a robotic arm using non-invasive brain-computer interface and computer vision guidance , 2019, Robotics Auton. Syst..

[38] Paul A. Viola,et al. Learning from one example through shared densities on transforms , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[39] Alan L. Yuille,et al. An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[41] Tao Xiang,et al. Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[43] Tao Xiang,et al. Learning a Deep Embedding Model for Zero-Shot Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Gabriel J. Brostow,et al. Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45] Andrea Vedaldi,et al. Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Yi Li,et al. RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[47] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Yang Wang,et al. Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49] Fei-Fei Li,et al. What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[50] Yanwei Fu,et al. How to trust unlabeled data? Instance Credibility Inference for Few-Shot Learning , 2020, ArXiv.

[51] Joshua B. Tenenbaum,et al. One-shot learning by inverting a compositional causal process , 2013, NIPS.

[52] Serge J. Belongie,et al. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53] James M. Rehg,et al. A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[54] Winston H. Hsu,et al. Learnable Gated Temporal Shift Module for Deep Video Inpainting , 2019 .

[55] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.

[56] Tao Xiang,et al. Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention , 2020, ArXiv.

[57] Jun Kong,et al. Informative joints based human action recognition using skeleton contexts , 2015, Signal Process. Image Commun..

[58] Kate Saenko,et al. Weakly-supervised Compositional FeatureAggregation for Few-shot Recognition , 2019, ArXiv.

[59] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Feng Liu,et al. Actor-Critic Sequence Training for Image Captioning , 2017, ArXiv.

[61] Tao Xiang,et al. Egocentric Action Recognition by Video Attention and Temporal Context , 2020, ArXiv.

[62] Hongdong Li,et al. Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[63] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[64] Yi Yang,et al. Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[65] Martial Hebert,et al. Image Deformation Meta-Networks for One-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Hongdong Li,et al. Rethinking Class Relations: Absolute-relative Few-shot Learning , 2020, ArXiv.

[67] Ioannis Patras,et al. TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[68] Feiyue Huang,et al. Harnessing Synthesized Abstraction Images to Improve Facial Attribute Recognition , 2018, IJCAI.

[69] Yu-Gang Jiang,et al. Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent , 2019, ACM Multimedia.

[70] Yanwei Fu,et al. Instance Credibility Inference for Few-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[72] Martial Hebert,et al. Learning to Model the Tail , 2017, NIPS.