Learning Video Representations using Contrastive Bidirectional Transformer
暂无分享,去创建一个
Cordelia Schmid | Chen Sun | Kevin Murphy | Fabien Baradel | C. Schmid | K. Murphy | Chen Sun | Fabien Baradel
[1] Jonathan Tompson,et al. Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.
[3] Luowei Zhou,et al. Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.
[4] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[5] Juergen Gall,et al. NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[6] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[7] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[8] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Alexander A. Alemi,et al. On Variational Bounds of Mutual Information , 2019, ICML.
[10] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.
[11] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[13] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[14] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.
[15] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.
[17] Martial Hebert,et al. Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[18] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[19] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[20] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[22] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[23] Longlong Jing,et al. Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction. , 2018, 1811.11387.
[24] Stephen J. McKenna,et al. Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.
[25] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Yazan Abu Farha,et al. When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[27] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[28] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[29] Lorenzo Torresani,et al. C3D: Generic Features for Video Analysis , 2014, ArXiv.
[30] Ivan Laptev,et al. Leveraging the Present to Anticipate the Future in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[31] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[33] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[34] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.
[35] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.
[36] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[37] Chenliang Xu,et al. Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[39] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[40] Wei Liu,et al. Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Alex Wang,et al. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.
[42] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[44] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[45] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.
[48] Yansong Tang,et al. COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[50] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.
[51] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[52] Abhinav Gupta,et al. Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[53] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[54] Ivan Laptev,et al. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.
[55] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[56] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[57] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[58] Cordelia Schmid,et al. A Structured Model for Action Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Sanja Fidler,et al. MovieGraphs: Towards Understanding Human-Centric Situations from Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[60] Cordelia Schmid,et al. Relational Action Forecasting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[61] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[62] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.
[63] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[64] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[65] Aaron C. Courville,et al. MINE: Mutual Information Neural Estimation , 2018, ArXiv.