Domain-Specific Priors and Meta Learning for Low-shot First-Person Action Recognition

The lack of large-scale real datasets with annotationsmakes transfer learning a necessity for video activity under-standing. Within this scope, we aim at developing an effec-tive method for low-shot transfer learning for first-personaction classification. We leverage independently trained lo-cal visual cues to learn representations that can be trans-ferred from a source domain providing primitive action la-bels to a target domain with only a handful of examples.Such visual cues include object-object interactions, handgrasps and motion within regions that are a function of handlocations. We suggest a framework based on meta-learningto appropriately extract the distinctive and domain invari-ant components of the deployed visual cues, so to be able totransfer action classification models across public datasetscaptured with different scene configurations. We thoroughlyevaluate our methodology and report promising results overstate-of-the-art action classification approaches for bothinter-class and inter-dataset transfer.

[1]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[2]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[5]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[6]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[7]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[8]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Fatih Murat Porikli,et al.  One-Shot Action Localization by Learning Sequence Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[14]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Yoichi Sato,et al.  Understanding Hand-Object Manipulation with Grasp Types and Object Attributes , 2016, Robotics: Science and Systems.

[18]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[19]  James M. Rehg,et al.  Modeling Actions through State Changes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[21]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[22]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[23]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[25]  Carlos Medrano,et al.  Fast Simplex-HMM for One-Shot Learning Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[27]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[28]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[29]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[32]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Amos J. Storkey,et al.  How to train your MAML , 2018, ICLR.

[36]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[37]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[38]  Cees Snoek,et al.  Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[44]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[46]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[47]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[48]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[50]  Dima Damen,et al.  Towards an Unequivocal Representation of Actions , 2018, CVPR Workshops.

[51]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Cordelia Schmid,et al.  Actor and Observer: Joint Modeling of First and Third-Person Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Cordelia Schmid,et al.  Actor and Observer: Joint Modeling of First and Third-Person Videos , 2018, CVPR.

[54]  Ivan Laptev,et al.  Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[59]  Diane J. Cook,et al.  Transfer learning for activity recognition: a survey , 2013, Knowledge and Information Systems.

[60]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[62]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[63]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[64]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[65]  G. Evans,et al.  Learning to Optimize , 2008 .

[66]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[67]  Andrew Zisserman,et al.  What have We Learned from Deep Representations for Action Recognition? , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[69]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[70]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .