暂无分享,去创建一个
James Glass | Brian Kingsbury | Michael Picheny | Antonio Torralba | Samuel Thomas | Dhiraj Joshi | David Harwath | Kartik Audhkhasi | Rogerio Feris | Andrew Rouditchenko | Angie Boggust
[1] Justin Salamon,et al. Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[3] Kristen Grauman,et al. Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[4] Rogério Schmidt Feris,et al. Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.
[5] Sandy Ritchie,et al. Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data , 2019, INTERSPEECH.
[6] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[7] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[10] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[11] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.
[12] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[13] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.
[14] Mark Hasegawa-Johnson,et al. Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts , 2019, INTERSPEECH.
[15] Ivan Laptev,et al. Learnable pooling with Context Gating for video classification , 2017, ArXiv.
[16] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[17] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[18] Michael Picheny,et al. Grounding Spoken Words in Unlabeled Video , 2019, CVPR Workshops.
[19] Emmanuel Dupoux,et al. Learning Words from Images and Speech , 2014 .
[20] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[21] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.
[22] Mathew Monfort,et al. Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Chuang Gan,et al. The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Jason Baldridge,et al. Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval , 2021, Interspeech.
[26] Xuelong Li,et al. Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Grzegorz Chrupala. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques , 2021, ArXiv.
[28] James Glass,et al. Learning Words by Drawing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[30] James Glass,et al. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.
[31] Nuno Vasconcelos,et al. Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.
[32] Gabriel Ilharco,et al. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.
[33] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[34] Bolei Zhou,et al. Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.
[35] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[36] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[37] Juan Carlos Niebles,et al. Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Chuang Gan,et al. Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[39] James R. Glass,et al. Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data , 2018, ArXiv.
[40] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[41] Juan Carlos Niebles,et al. Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[42] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[43] Yansong Tang,et al. COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Luowei Zhou,et al. Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.
[45] Chuang Gan,et al. Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[46] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[47] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48] James R. Glass,et al. Learning modality-invariant representations for speech and images , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[49] Ivan Laptev,et al. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.
[50] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.
[51] Kristen Grauman,et al. 2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[53] Herman Arnold Engelbrecht,et al. Multimodal One-shot Learning of Speech and Images , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[54] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Ivan Laptev,et al. Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Andrew Zisserman,et al. QuerYD: A video dataset with high-quality textual and audio narrations , 2020, ArXiv.
[58] Mark Hasegawa-Johnson,et al. Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[59] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Dima Damen,et al. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[61] Hilde Kuehne,et al. Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data , 2019, ArXiv.
[62] Mirjam Ernestus,et al. Language learning using Speech to Image retrieval , 2019, INTERSPEECH.
[63] Masood S. Mortazavi. Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks , 2020, ArXiv.
[64] James R. Glass,et al. Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.
[65] Fadime Sener,et al. Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[66] Tae-Hyun Oh,et al. On Learning Associations of Faces and Voices , 2018, ACCV.
[67] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[68] Andrew Zisserman,et al. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[69] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[71] Karen Livescu,et al. Semantic Query-by-example Speech Search Using Visual Grounding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[72] Florian Metze,et al. Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[73] Rami Ben-Ari,et al. Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.
[74] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.
[75] Gregory Shakhnarovich,et al. Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[76] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[77] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[78] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[79] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.
[80] Michael Roth,et al. Visually grounded cross-lingual keyword spotting in speech , 2018, SLTU.
[81] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[82] Florian Metze,et al. Learning from Multiview Correlations in Open-domain Videos , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[83] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[84] Chen Fang,et al. Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[85] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[86] Mark Hasegawa-Johnson,et al. A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions , 2020, INTERSPEECH.
[87] Olivier Rosec,et al. SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, ArXiv.
[88] Cordelia Schmid,et al. Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.
[89] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[90] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[91] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[92] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.
[93] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.