论文信息 - With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.

[1] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[4] Yi Li,et al. RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[5] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[6] Kristen Grauman,et al. Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Yale Song,et al. Parameter Efficient Multimodal Transformers for Video Representation Learning , 2020, ICLR.

[8] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[9] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10] Jian Ma,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..

[11] Oswald Lanz,et al. Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition , 2018, BMVC.

[12] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[13] Yoshua Bengio,et al. On integrating a language model into neural machine translation , 2017, Comput. Speech Lang..

[14] Efthymios Tzinis,et al. Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[15] Giovanni Maria Farinella,et al. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[17] Jonathan Huang,et al. Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Dima Damen,et al. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[19] Christian Wolf,et al. Object Level Visual Reasoning in Videos , 2018, ECCV.

[20] Dima Damen,et al. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Andrea Vedaldi,et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.

[22] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[23] Yi Yang,et al. Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Andrew Zisserman,et al. Temporal Query Networks for Fine-grained Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Sanja Fidler,et al. Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[26] Cordelia Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, ArXiv.

[27] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Gedas Bertasius,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[29] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[30] Dima Damen,et al. Slow-Fast Auditory Streams for Audio Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] David A. Ross,et al. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Kyle Min,et al. Integrating Human Gaze into Attention for Egocentric Activity Recognition , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33] Ronald Poppe,et al. Multitask Learning to Improve Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[34] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[37] Yoichi Sato,et al. Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions , 2019, ArXiv.

[38] Kyomin Jung,et al. Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[39] Yang Wang,et al. Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[42] Georgios Tzimiropoulos,et al. Space-time Mixing Attention for Video Transformer , 2021, NeurIPS.

[43] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[44] Fadime Sener,et al. Temporal Aggregate Representations for Long-Range Video Understanding , 2020, ECCV.

[45] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] James M. Rehg,et al. In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[49] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[50] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[51] Chuang Gan,et al. Foley Music: Learning to Generate Music from Videos , 2020, ECCV.

[52] Carlos Gómez-Rodríguez,et al. Comparing neural‐ and N‐gram‐based language models for word segmentation , 2018, J. Assoc. Inf. Sci. Technol..

[53] Esa Rahtu,et al. Multi-modal Dense Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54] Ze-Nian Li,et al. Learning Spatiotemporal Attention for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[55] Alejandro Cartas,et al. Modeling Long-Term Interactions to Enhance Action Recognition , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[56] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.

[58] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[59] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[60] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61] Basura Fernando,et al. Human Action Sequence Classification , 2019, ArXiv.

[62] Sergio Escalera,et al. LSTA: Long Short-Term Attention for Egocentric Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[64] Ali Farhadi,et al. Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Juergen Gall,et al. Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[67] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[69] Koichi Shinoda,et al. CTC Network with Statistical Language Modeling for Action Sequence Recognition in Videos , 2017, ACM Multimedia.

[70] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[71] Gedas Bertasius,et al. Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.