Speech2Action: Cross-Modal Supervision for Action Recognition
暂无分享,去创建一个
Cordelia Schmid | Andrew Zisserman | Rahul Sukthankar | Chen Sun | David A. Ross | Arsha Nagrani | David Ross | R. Sukthankar | C. Schmid | Andrew Zisserman | Chen Sun | Arsha Nagrani
[1] Lorenzo Torresani,et al. DistInit: Learning Video Representations Without a Single Labeled Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[2] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[4] Cordelia Schmid,et al. Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[5] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[6] Abdullah Al Mamun,et al. Aligning movies with scripts by exploiting temporal ordering constraints , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).
[7] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[8] Andrew Zisserman,et al. Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.
[9] B. Taskar,et al. Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[10] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL 2006.
[11] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[12] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[13] Rada Mihalcea,et al. Identifying Visible Actions in Lifestyle Vlogs , 2019, ACL.
[14] Longlong Jing,et al. Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction. , 2018, 1811.11387.
[15] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[17] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[19] Ivan Laptev,et al. Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[20] Rainer Stiefelhagen,et al. “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[21] Yansong Tang,et al. COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Cordelia Schmid,et al. Actions in context , 2009, CVPR.
[23] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.
[25] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[26] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[27] Andrew Zisserman,et al. “Who are you?” - Learning person specific classifiers from video , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[28] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[29] Haroon Idrees,et al. The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..
[30] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.
[31] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.
[32] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[33] Cordelia Schmid,et al. Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.
[34] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[35] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[36] Jean Ponce,et al. Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.
[37] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[39] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[40] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[41] Heng Wang,et al. SLAC: A Sparsely Labeled Dataset for Action Classification and Localization , 2017, ArXiv.
[42] Gerhard Weikum,et al. Werdy: Recognition and Disambiguation of Verbs and Verb Phrases with Syntactic and Semantic Pruning , 2014, EMNLP.
[43] Wei Liu,et al. Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[46] Erica Klarreich,et al. Hello, my name is… , 2014, CACM.
[47] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[48] Yingli Tian,et al. Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations , 2018, ArXiv.
[49] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[50] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[51] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[52] Rainer Stiefelhagen,et al. Book2Movie: Aligning video scenes with book chapters , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53] R. Michael Young,et al. Automated Screenplay Annotation for Extracting Storytelling Knowledge , 2017, AIIDE Workshops.
[54] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[55] Cordelia Schmid,et al. Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.
[56] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.