暂无分享,去创建一个
Dima Damen | Evangelos Kazakos | Jaesung Huh | Arsha Nagrani | Andrew Zisserman | Andrew Zisserman | D. Damen | Arsha Nagrani | Jaesung Huh | E. Kazakos | A. Zisserman
[1] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[2] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[4] Yi Li,et al. RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.
[5] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..
[6] Kristen Grauman,et al. Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Yale Song,et al. Parameter Efficient Multimodal Transformers for Video Representation Learning , 2020, ICLR.
[8] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.
[9] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[10] Jian Ma,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..
[11] Oswald Lanz,et al. Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition , 2018, BMVC.
[12] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.
[13] Yoshua Bengio,et al. On integrating a language model into neural machine translation , 2017, Comput. Speech Lang..
[14] Efthymios Tzinis,et al. Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.
[15] Giovanni Maria Farinella,et al. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[17] Jonathan Huang,et al. Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Dima Damen,et al. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[19] Christian Wolf,et al. Object Level Visual Reasoning in Videos , 2018, ECCV.
[20] Dima Damen,et al. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Andrea Vedaldi,et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.
[22] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[23] Yi Yang,et al. Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[24] Andrew Zisserman,et al. Temporal Query Networks for Fine-grained Video Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Sanja Fidler,et al. Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.
[26] Cordelia Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, ArXiv.
[27] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Gedas Bertasius,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[29] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.
[30] Dima Damen,et al. Slow-Fast Auditory Streams for Audio Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[31] David A. Ross,et al. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[32] Kyle Min,et al. Integrating Human Gaze into Attention for Egocentric Activity Recognition , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[33] Ronald Poppe,et al. Multitask Learning to Improve Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[34] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[36] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[37] Yoichi Sato,et al. Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions , 2019, ArXiv.
[38] Kyomin Jung,et al. Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.
[39] Yang Wang,et al. Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[41] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[42] Georgios Tzimiropoulos,et al. Space-time Mixing Attention for Video Transformer , 2021, NeurIPS.
[43] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.
[44] Fadime Sener,et al. Temporal Aggregate Representations for Long-Range Video Understanding , 2020, ECCV.
[45] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[46] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[47] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] James M. Rehg,et al. In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.
[49] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..
[50] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.
[51] Chuang Gan,et al. Foley Music: Learning to Generate Music from Videos , 2020, ECCV.
[52] Carlos Gómez-Rodríguez,et al. Comparing neural‐ and N‐gram‐based language models for word segmentation , 2018, J. Assoc. Inf. Sci. Technol..
[53] Esa Rahtu,et al. Multi-modal Dense Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[54] Ze-Nian Li,et al. Learning Spatiotemporal Attention for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[55] Alejandro Cartas,et al. Modeling Long-Term Interactions to Enhance Action Recognition , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).
[56] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[57] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[58] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[59] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[60] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[61] Basura Fernando,et al. Human Action Sequence Classification , 2019, ArXiv.
[62] Sergio Escalera,et al. LSTA: Long Short-Term Attention for Egocentric Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[64] Ali Farhadi,et al. Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Juergen Gall,et al. Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.
[67] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[68] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.
[69] Koichi Shinoda,et al. CTC Network with Statistical Language Modeling for Action Sequence Recognition in Videos , 2017, ACM Multimedia.
[70] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.
[71] Gedas Bertasius,et al. Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.