论文信息 - Few-Shot Video Classification via Temporal Alignment

Few-Shot Video Classification via Temporal Alignment

Difficulty in collecting and annotating large-scale video data raises a growing interest in learning models which can recognize novel classes with only a few training examples. In this paper, we propose the Ordered Temporal Alignment Module (OTAM), a novel few-shot learning framework that can learn to classify a previously unseen video. While most previous work neglects long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video data through ordered temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, our proposed pipeline learns a deep distance measurement of the query video with respect to novel class proxies over its alignment path. We adopt an episode-based training scheme and directly optimize the few-shot learning objective. We evaluate OTAM on two challenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines and outperforms state-of-the-art benchmarks by a large margin.

[1] Meinard Müller,et al. Dynamic Time Warping , 2008 .

[2] Mubarak Shah,et al. A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[3] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[4] Arthur Mensch,et al. Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[5] Ioannis Patras,et al. TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[6] Shuohang Wang,et al. A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[7] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[9] Shuaib Ahmed,et al. ProtoGAN: Towards Few Shot Learning for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10] Tal Hassner,et al. One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[11] Matthew A. Brown,et al. Low-Shot Learning with Imprinted Weights , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.

[13] Martial Hebert,et al. Low-Shot Learning from Imaginary Data , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Yu-Chiang Frank Wang,et al. A Closer Look at Few-shot Classification , 2019, ICLR.

[16] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[17] Tao Xiang,et al. Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Joan Bruna,et al. Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[19] Razvan Pascanu,et al. Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[20] Bharath Hariharan,et al. Low-Shot Visual Recognition by Shrinking and Hallucinating Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[23] Yi Yang,et al. Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[24] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Thomas Brox,et al. ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[27] Meinard Müller,et al. Information retrieval for music and motion , 2007 .

[28] Henryk Sienkiewicz,et al. Quo Vadis? , 1967, American Association of Industrial Nurses journal.

[29] Gregory R. Koch,et al. Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[30] Тараса Шевченка,et al. Quo vadis? , 2013, Clinical chemistry.

[31] Markus H. Gross,et al. A Neural Multi-sequence Alignment TeCHnique (NeuMATCH) , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Piyush Rai,et al. A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33] Song Han,et al. Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[34] Brian Hutchinson,et al. Metric-Based Few-Shot Learning for Video Action Recognition , 2019, ArXiv.

[35] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[36] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[37] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[39] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[40] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[41] Juan Carlos Niebles,et al. Learning Temporal Action Proposals With Fewer Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Yu-Gang Jiang,et al. Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent , 2019, ACM Multimedia.

[43] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[44] Juan Carlos Niebles,et al. D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[46] J. Schulman,et al. Reptile: a Scalable Metalearning Algorithm , 2018 .

[47] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[49] Juergen Gall,et al. NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50] Hong Yu,et al. Meta Networks , 2017, ICML.

[51] Nikos Komodakis,et al. Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[53] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54] Aurko Roy,et al. Learning to Remember Rare Events , 2017, ICLR.

[55] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[56] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[58] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[60] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.