Few-Shot Video Classification via Temporal Alignment

Difficulty in collecting and annotating large-scale video data raises a growing interest in learning models which can recognize novel classes with only a few training examples. In this paper, we propose the Ordered Temporal Alignment Module (OTAM), a novel few-shot learning framework that can learn to classify a previously unseen video. While most previous work neglects long-term temporal ordering information, our proposed model explicitly leverages the temporal ordering information in video data through ordered temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, our proposed pipeline learns a deep distance measurement of the query video with respect to novel class proxies over its alignment path. We adopt an episode-based training scheme and directly optimize the few-shot learning objective. We evaluate OTAM on two challenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines and outperforms state-of-the-art benchmarks by a large margin.

[1]  Meinard Müller,et al.  Dynamic Time Warping , 2008 .

[2]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Arthur Mensch,et al.  Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[5]  Ioannis Patras,et al.  TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[6]  Shuohang Wang,et al.  A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[9]  Shuaib Ahmed,et al.  ProtoGAN: Towards Few Shot Learning for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Tal Hassner,et al.  One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[11]  Matthew A. Brown,et al.  Low-Shot Learning with Imprinted Weights , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[13]  Martial Hebert,et al.  Low-Shot Learning from Imaginary Data , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[16]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[17]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Joan Bruna,et al.  Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[19]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[20]  Bharath Hariharan,et al.  Low-Shot Visual Recognition by Shrinking and Hallucinating Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[23]  Yi Yang,et al.  Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[24]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[27]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[28]  Henryk Sienkiewicz,et al.  Quo Vadis? , 1967, American Association of Industrial Nurses journal.

[29]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[30]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[31]  Markus H. Gross,et al.  A Neural Multi-sequence Alignment TeCHnique (NeuMATCH) , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Piyush Rai,et al.  A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Song Han,et al.  Temporal Shift Module for Efficient Video Understanding , 2018, ArXiv.

[34]  Brian Hutchinson,et al.  Metric-Based Few-Shot Learning for Video Action Recognition , 2019, ArXiv.

[35]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[36]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[37]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[39]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[40]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[41]  Juan Carlos Niebles,et al.  Learning Temporal Action Proposals With Fewer Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Yu-Gang Jiang,et al.  Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent , 2019, ACM Multimedia.

[43]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[44]  Juan Carlos Niebles,et al.  D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[46]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[47]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[49]  Juergen Gall,et al.  NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Hong Yu,et al.  Meta Networks , 2017, ICML.

[51]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[53]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Aurko Roy,et al.  Learning to Remember Rare Events , 2017, ICLR.

[55]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[56]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.