With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

In egocentric videos, actions occur in quick succession. We capitalise on the action’s temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[4]  Yi Li,et al.  RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[5]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[6]  Kristen Grauman,et al.  Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yale Song,et al.  Parameter Efficient Multimodal Transformers for Video Representation Learning , 2020, ICLR.

[8]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[9]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[10]  Jian Ma,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..

[11]  Oswald Lanz,et al.  Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition , 2018, BMVC.

[12]  Yong Jae Lee,et al.  Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[13]  Yoshua Bengio,et al.  On integrating a language model into neural machine translation , 2017, Comput. Speech Lang..

[14]  Efthymios Tzinis,et al.  Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[15]  Giovanni Maria Farinella,et al.  What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Jonathan Huang,et al.  Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Dima Damen,et al.  Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[19]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[20]  Dima Damen,et al.  Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Andrea Vedaldi,et al.  Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.

[22]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[23]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Andrew Zisserman,et al.  Temporal Query Networks for Fine-grained Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Sanja Fidler,et al.  Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[26]  Cordelia Schmid,et al.  Attention Bottlenecks for Multimodal Fusion , 2021, ArXiv.

[27]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[29]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[30]  Dima Damen,et al.  Slow-Fast Auditory Streams for Audio Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Kyle Min,et al.  Integrating Human Gaze into Attention for Egocentric Activity Recognition , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Ronald Poppe,et al.  Multitask Learning to Improve Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[34]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[37]  Yoichi Sato,et al.  Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions , 2019, ArXiv.

[38]  Kyomin Jung,et al.  Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[39]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[42]  Georgios Tzimiropoulos,et al.  Space-time Mixing Attention for Video Transformer , 2021, NeurIPS.

[43]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[44]  Fadime Sener,et al.  Temporal Aggregate Representations for Long-Range Video Understanding , 2020, ECCV.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Du Tran,et al.  What Makes Training Multi-Modal Classification Networks Hard? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[49]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[50]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[51]  Chuang Gan,et al.  Foley Music: Learning to Generate Music from Videos , 2020, ECCV.

[52]  Carlos Gómez-Rodríguez,et al.  Comparing neural‐ and N‐gram‐based language models for word segmentation , 2018, J. Assoc. Inf. Sci. Technol..

[53]  Esa Rahtu,et al.  Multi-modal Dense Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Ze-Nian Li,et al.  Learning Spatiotemporal Attention for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[55]  Alejandro Cartas,et al.  Modeling Long-Term Interactions to Enhance Action Recognition , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[56]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[58]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[59]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[60]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Basura Fernando,et al.  Human Action Sequence Classification , 2019, ArXiv.

[62]  Sergio Escalera,et al.  LSTA: Long Short-Term Attention for Egocentric Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[64]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[67]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[69]  Koichi Shinoda,et al.  CTC Network with Statistical Language Modeling for Action Sequence Recognition in Videos , 2017, ACM Multimedia.

[70]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[71]  Gedas Bertasius,et al.  Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.