Compound Prototype Matching for Few-shot Action Recognition

Few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. In this work, we propose a novel approach that first summarizes each video into compound prototypes consisting of a group of global prototypes and a group of focused prototypes, and then compares video similarity based on the prototypes. Each global prototype is encouraged to summarize a specific aspect from the entire video, for example, the start/evolution of the action. Since no clear annotation is provided for the global prototypes, we use a group of focused prototypes to focus on certain timestamps in the video. We compare video similarity by matching the compound prototypes between the support and query videos. The global prototypes are directly matched to compare videos from the same perspective, for example, to compare whether two actions start similarly. For the focused prototypes, since actions have various temporal variations in the videos, we apply bipartite matching to allow the comparison of actions with different temporal positions and shifts. Experiments demonstrate that our proposed method achieves state-of-the-art results on multiple benchmarks.

[1]  F. Khan,et al.  Spatio-temporal Relation Modeling for Few-shot Action Recognition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tatsuya Harada,et al.  Leveraging Human Selective Attention for Medical Image Analysis with Limited Training Data , 2021, BMVC.

[3]  Yusuke Sugano,et al.  Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips , 2021, BMVC.

[4]  Sheng Guo,et al.  A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark , 2021, BMVC.

[5]  Zhongang Qi,et al.  Semantic-Guided Relation Propagation Network for Few-shot Action Recognition , 2021, ACM Multimedia.

[6]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Fuxin Li,et al.  Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Matthew Fisher,et al.  Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Minsu Cho,et al.  Relational Embedding for Few-Shot Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Tao Xiang,et al.  Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Bodo Rosenhahn,et al.  Spatial-Temporal Transformer for Dynamic Scene Graph Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  John See,et al.  TA2N: Two-Stage Action Alignment Network for Few-Shot Action Recognition , 2021, AAAI.

[13]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[14]  Feng Wu,et al.  Lesion-Aware Transformers for Diabetic Retinopathy Grading , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Songyang Zhang,et al.  Learning Implicit Temporal Alignment for Few-shot Video Classification , 2021, IJCAI.

[16]  Maksims Volkovs,et al.  Weakly Supervised Action Selection Learning in Video , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Temporal Query Networks for Fine-grained Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Wengang Zhou,et al.  TransVG: End-to-End Visual Grounding with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Feiyue Huang,et al.  Learning Dynamic Alignment via Meta-filter for Few-shot Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tao Xiang,et al.  Few-shot Action Recognition with Prototype-centered Attentive Learning , 2021, BMVC.

[21]  Majid Mirmehdi,et al.  Temporal-Relational CrossTransformers for Few-Shot Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chris Jermaine,et al.  Few-shot Image Classification: Just Use a Library of Pre-trained Feature Extractors and a Simple Classifier , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yanning Zhang,et al.  Few-shot Action Recognition with Implicit Temporal Alignment and Pair Similarity Optimization , 2020, Comput. Vis. Image Underst..

[24]  Yu-Gang Jiang,et al.  Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition , 2020, ACM Multimedia.

[25]  Ankush Gupta,et al.  CrossTransformers: spatially-aware few-shot transfer , 2020, NeurIPS.

[26]  Xuming He,et al.  Part-aware Prototype Network for Few-shot Semantic Segmentation , 2020, ECCV.

[27]  Matthijs Douze,et al.  Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Matthijs Douze,et al.  Generalized Many-Way Few-Shot Video Classification , 2020, ECCV Workshops.

[29]  Yi Yang,et al.  Label Independent Memory for Semi-Supervised Few-Shot Video Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yoichi Sato,et al.  Improving Action Segmentation via Graph-Based Temporal Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[32]  Trevor Darrell,et al.  Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning , 2020, ECCV.

[33]  Guosheng Lin,et al.  CRNet: Cross-Reference Networks for Few-Shot Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Trevor Darrell,et al.  Frustratingly Simple Few-Shot Object Detection , 2020, ICML.

[35]  Guosheng Lin,et al.  DeepEMD: Few-Shot Image Classification With Differentiable Earth Mover’s Distance and Structured Classifiers , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Hongdong Li,et al.  Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[37]  Ali K. Thabet,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Shuaib Ahmed,et al.  ProtoGAN: Towards Few Shot Learning for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[39]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Stefano Soatto,et al.  A Baseline for Few-Shot Image Classification , 2019, ICLR.

[41]  Jiashi Feng,et al.  PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Yu-Wing Tai,et al.  Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ioannis Patras,et al.  TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[44]  Juan Carlos Niebles,et al.  Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Xiaogang Wang,et al.  Finding Task-Relevant Features for Few-Shot Learning by Category Traversal , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Juan Carlos Niebles,et al.  D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Fei Sha,et al.  Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xin Wang,et al.  Few-Shot Object Detection via Feature Reweighting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Amos J. Storkey,et al.  How to train your MAML , 2018, ICLR.

[50]  Yi Yang,et al.  Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[51]  José M. F. Moura,et al.  Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[52]  Heng Wang,et al.  Dense Dilated Network for Few Shot Action Recognition , 2018, ICMR.

[53]  Chunhua Shen,et al.  Piecewise Classifier Mappings: Learning Fine-Grained Learners for Novel Categories With Few Examples , 2018, IEEE Transactions on Image Processing.

[54]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[56]  Piyush Rai,et al.  A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[58]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Wei Shen,et al.  Few-Shot Image Recognition by Predicting Parameters from Activations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[61]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[63]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[64]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[65]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[66]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[67]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[68]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[69]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[71]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[72]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[73]  Tal Hassner,et al.  One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[74]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[76]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[77]  Xiantong Zhen,et al.  Few-Shot Semantic Segmentation with Democratic Attention Networks , 2020, ECCV.

[78]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .