论文信息 - Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent

Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent

One-shot learning aims to recognize novel target classes from few examples by transferring knowledge from source classes, under a general assumption that the source and target classes are semantically related but not exactly the same. Based on this assumption, recent work has focused on image-based one-shot learning, while little work has addressed video-based one shot learning. One of the challenges lies in that it is difficult to maintain the disjoint-class assumption for videos, since video clips of target classes may potentially appear in the videos of source classes. To address this issue, we introduce a novel setting, termed as embodied agents based one-shot learning, which leverages synthetic videos produced in a virtual environment to understand realistic videos of target classes. In this setting, we further propose two types of learning tasks: embodied one-shot video domain adaptation and embodied one-shot video transfer recognition. These tasks serve as a testbed for evaluating video related one-shot learning tasks. In addition, we propose a general video segment augmentation method, which significantly facilitates a variety of one-shot learning tasks. Experimental results validate the soundness of our setting and learning tasks, and also show the effectiveness of our augmentation approach to video recognition in the small-sample size regime.

[1] Martial Hebert,et al. Low-Shot Learning from Imaginary Data , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2] Joshua B. Tenenbaum,et al. One shot learning of simple visual concepts , 2011, CogSci.

[3] Chengqi Zhang,et al. Dynamic Concept Composition for Zero-Example Event Detection , 2016, AAAI.

[4] Cees Snoek,et al. VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[5] Martial Hebert,et al. Learning to Learn: Model Regression Networks for Easy Small Sample Learning , 2016, ECCV.

[6] Yi Yang,et al. Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[7] Chin-Hui Lee,et al. TITGT at TRECVID 2009 workshop , 2009 .

[8] Matthias Nießner,et al. Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[9] Georges Quénot,et al. TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[10] Gregory R. Koch,et al. Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[11] Jitendra Malik,et al. On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[12] Shih-Fu Chang,et al. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification , 2017, IEEE Transactions on Multimedia.

[13] Yi Yang,et al. Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[14] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Alan L. Yuille,et al. UnrealCV: Connecting Computer Vision to Unreal Engine , 2016, ECCV Workshops.

[18] Alexander Hauptmann,et al. Informedia @ TRECVID2009: Analyzing Video Motions , 2009, TRECVID.

[19] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Xiangyang Xue,et al. Multi-Level Semantic Feature Augmentation for One-Shot Learning , 2018, IEEE Transactions on Image Processing.

[21] W. Stroebe,et al. Beyond Vicary's fantasies: The impact of subliminal priming and brand choice , 2006 .

[22] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Tinne Tuytelaars,et al. Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[25] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[26] Martial Hebert,et al. Learning to Model the Tail , 2017, NIPS.

[27] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[29] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[31] Vladlen Koltun,et al. Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[32] Tao Xiang,et al. Learning Multimodal Latent Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Piyush Rai,et al. A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34] Paul A. Viola,et al. Learning from one example through shared densities on transforms , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[35] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Timothy E. Moore,et al. Subliminal Advertising: What you see is what you Get , 1982 .

[38] David A. Shamma,et al. The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[39] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Pietro Perona,et al. A Bayesian approach to unsupervised one-shot learning of object categories , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[41] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[43] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[44] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[45] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[47] Martial Hebert,et al. Image Deformation Meta-Networks for One-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Thomas A. Funkhouser,et al. Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Martial Hebert,et al. Learning from Small Sample Sets by Combining Unsupervised Meta-Training with CNNs , 2016, NIPS.

[50] Tal Hassner,et al. One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[51] Pietro Perona,et al. One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[54] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[55] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[56] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] M. Overgaard,et al. Introspection and subliminal perception , 2004 .

[58] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).