暂无分享,去创建一个
James Glass | Brian Kingsbury | Michael Picheny | Shih-Fu Chang | James R. Glass | Samuel Thomas | David Harwath | Rameswar Panda | Rogerio Feris | Andrew Rouditchenko | Angie Boggust | Kevin Duarte | Brian Chen | Hilde Kuehne | M. Picheny | Brian Kingsbury | Shih-Fu Chang | R. Feris | David F. Harwath | Samuel Thomas | Andrew Rouditchenko | R. Panda | Kevin Duarte | Brian Chen | Hilde Kuehne | Angie Boggust | Rameswar Panda
[1] Marco Cuturi,et al. Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.
[2] Andrea Vedaldi,et al. Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.
[3] Laurens van der Maaten,et al. Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.
[5] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[6] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.
[7] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[8] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.
[9] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Ting Chen,et al. Intriguing Properties of Contrastive Losses , 2020, NeurIPS.
[12] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[13] James R. Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.
[14] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[15] Ivan Laptev,et al. Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Xirong Li,et al. Dual Encoding for Video Retrieval by Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[17] Michael S. Ryoo,et al. Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[19] Luc Van Gool,et al. SCAN: Learning to Classify Images Without Labels , 2020, ECCV.
[20] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[21] Abhinav Gupta,et al. ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Antonio Torralba,et al. See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.
[23] Andrea Vedaldi,et al. Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.
[24] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.
[25] Lei Le,et al. Supervised autoencoders: Improving generalization performance with unsupervised regularizers , 2018, NeurIPS.
[26] Michael Picheny,et al. Grounding Spoken Words in Unlabeled Video , 2019, CVPR Workshops.
[27] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[28] Rami Ben-Ari,et al. Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.
[29] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[30] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.
[31] Junnan Li,et al. Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.
[32] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Juergen Gall,et al. Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[35] Lukasz Kaiser,et al. One Model To Learn Them All , 2017, ArXiv.
[36] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[38] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Hilde Kuehne,et al. Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data , 2019, ArXiv.
[40] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[41] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.
[43] Joydeep Ghosh,et al. Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..
[44] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[47] Pietro Perona,et al. Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Gabriel Ilharco,et al. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.
[49] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[50] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[51] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[52] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[53] M. Cugmas,et al. On comparing partitions , 2015 .
[54] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.