Temporal-Relational CrossTransformers for Few-Shot Action Recognition

We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared.1Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

[1]  Tao Xiang,et al.  Few-shot Action Recognition with Prototype-centered Attentive Learning , 2021, BMVC.

[2]  Ankush Gupta,et al.  CrossTransformers: spatially-aware few-shot transfer , 2020, NeurIPS.

[3]  Yi Yang,et al.  Label Independent Memory for Semi-Supervised Few-Shot Video Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Hongdong Li,et al.  Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[5]  Frank D. Wood,et al.  Improved Few-Shot Visual Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Dima Damen,et al.  Retro-Actions: Learning 'Close' by Time-Reversing 'Open' Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[7]  Shuaib Ahmed,et al.  ProtoGAN: Towards Few Shot Learning for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[8]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Ioannis Patras,et al.  TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[10]  Lorenzo Torresani,et al.  Only Time Can Tell: Discovering Temporal Data for Temporal Modeling , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Juan Carlos Niebles,et al.  Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sebastian Nowozin,et al.  Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes , 2019, NeurIPS.

[13]  Hugo Larochelle,et al.  Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples , 2019, ICLR.

[14]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Xin Wang,et al.  Few-Shot Object Detection via Feature Reweighting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Yoshua Bengio,et al.  MetaGAN: An Adversarial Approach to Few-Shot Learning , 2018, NeurIPS.

[17]  Yi Yang,et al.  Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[18]  Juan Carlos Niebles,et al.  What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[20]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[21]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[22]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[26]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[27]  Yi Yang,et al.  Few-Shot Object Recognition from Machine-Labeled Web Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[29]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[30]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[33]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[34]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.