论文信息 - Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Incorporating these signals as knowledge distillation losses during pre-training results in models that benefit from both the scale and diversity of third-person video data, as well as representations that capture salient egocentric properties. Our experiments show that our "Ego-Exo" framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

K. Grauman | Bo Xiong | Yanghao Li | Tushar Nagarajan

[1] Xiaohan Wang,et al. Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Hazel Doughty,et al. Rescaling Egocentric Vision , 2020, ArXiv.

[3] Geoffrey E. Hinton,et al. Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[4] Yiannis Aloimonos,et al. Egocentric Object Manipulation Graphs , 2020, ArXiv.

[5] David F. Fouhey,et al. Understanding Human Hands in Contact at Internet Scale , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Giovanni Maria Farinella,et al. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Juan Carlos Niebles,et al. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Gaurav Sharma,et al. Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9] Hugo Latapie,et al. Exocentric to Egocentric Image Generation Via Parallel Generative Adversarial Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[11] K. Grauman,et al. Ego-Topo: Environment Affordances From Egocentric Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] K. Grauman,et al. Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] James M. Rehg,et al. Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity , 2019, ArXiv.

[14] Ze-Nian Li,et al. Learning Spatiotemporal Attention for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[15] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.

[17] Giovanni Maria Farinella,et al. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Remco C. Veltkamp,et al. Egocentric Hand Track and Object-Based Human Action Recognition , 2019, 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[19] Mubarak Shah,et al. Bridging the Domain Gap for Ground-to-Aerial Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Kristen Grauman,et al. You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Yunfei Liu,et al. What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis , 2019, ACM Multimedia.

[22] Marc Pollefeys,et al. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Lorenzo Torresani,et al. DistInit: Learning Video Representations Without a Single Labeled Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Fiora Pirri,et al. Anticipation and next action forecasting in video: an end-to-end model with memory , 2019, ArXiv.

[25] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Ravi Teja Mullapudi,et al. Online Model Distillation for Efficient Video Inference , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Hao Jiang,et al. Visual-GPS: Ego-Downward and Ambient Video Based Person Location Association , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28] Ali Borji,et al. From Third Person to First Person: Dataset and Baselines for Synthesis and Retrieval , 2018, ArXiv.

[29] Sergio Escalera,et al. LSTA: Long Short-Term Attention for Egocentric Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] James M. Rehg,et al. In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[32] Ali Borji,et al. Integrating Egocentric Videos in Top-View Surveillance Videos: Joint Identification and Temporal Alignment , 2018, ECCV.

[33] Christian Wolf,et al. Object Level Visual Reasoning in Videos , 2018, ECCV.

[34] Ali Borji,et al. An exocentric look at egocentric actions and vice versa , 2018, Comput. Vis. Image Underst..

[35] Cordelia Schmid,et al. Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos , 2018, ArXiv.

[36] Cordelia Schmid,et al. Actor and Observer: Joint Modeling of First and Third-Person Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[38] Yazan Abu Farha,et al. When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Michael S. Ryoo,et al. Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos , 2018, ECCV.

[40] Yoichi Sato,et al. Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[41] Ali Borji,et al. Cross-View Image Synthesis Using Conditional GANs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Tony X. Han,et al. Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[43] Juan Carlos Niebles,et al. Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.

[44] Yu-Chiang Frank Wang,et al. Summarizing First-Person Videos from Third Persons' Points of Views , 2017, ECCV.

[45] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[46] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Giovanni Maria Farinella,et al. Next-active-object prediction from egocentric videos , 2017, J. Vis. Commun. Image Represent..

[48] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[51] Yong Jae Lee,et al. Identifying First-Person Camera Wearers in Third-Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Joo-Hwee Lim,et al. Summarization of Egocentric Videos: A Comprehensive Survey , 2017, IEEE Transactions on Human-Machine Systems.

[53] Shervin Ardeshir,et al. EgoReID: Cross-view Self-Identification and Human Re-identification in Egocentric and Surveillance Videos , 2016, ArXiv.

[54] Nicholas Rhinehart,et al. First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[55] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[56] Yoichi Sato,et al. Visual Motif Discovery via First-Person Vision , 2016, ECCV.

[57] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[58] Ali Borji,et al. Ego2Top: Matching Viewers in Egocentric and Top-View Videos , 2016, ECCV.

[59] Kris M. Kitani,et al. Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Kristen Grauman,et al. Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Bernhard Schölkopf,et al. Unifying distillation and privileged information , 2015, ICLR.

[62] Jitendra Malik,et al. Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] James M. Rehg,et al. Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[66] Yong Jae Lee,et al. Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[67] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[68] Tamara L. Berg,et al. Temporal Perception and Prediction in Ego-Centric Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69] Ali Farhadi,et al. Action Recognition in the Presence of One Egocentric and Multiple Static Cameras , 2014, ACCV.

[70] James M. Rehg,et al. Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[71] Cristian Sminchisescu,et al. Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[72] Deva Ramanan,et al. Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[74] A. Meltzoff. Imitation of televised models by infants. , 1988, Child development.

[75] D. Premack,et al. Does the chimpanzee have a theory of mind? , 1978, Behavioral and Brain Sciences.

[76] M. Pollefeys,et al. Unified Egocentric Recognition of 3 D Hand-Object Poses and Interactions , 2019 .

[77] Elsevier Sdol. Journal of Visual Communication and Image Representation , 2009 .