暂无分享,去创建一个
James Glass | James R. Glass | Andrei Barbu | Boris Katz | Ian Palmer | Andrew Rouditchenko | Andrei Barbu | B. Katz | Andrew Rouditchenko | Ian Palmer
[1] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[2] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[3] Mirjam Ernestus,et al. Language learning using Speech to Image retrieval , 2019, INTERSPEECH.
[4] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[5] Alan L. Yuille,et al. Object Recognition with and without Objects , 2016, IJCAI.
[6] James R. Glass,et al. Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.
[7] James Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.
[8] Michael Picheny,et al. Grounding Spoken Words in Unlabeled Video , 2019, CVPR Workshops.
[9] Boris Katz,et al. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.
[10] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.
[11] James R. Glass,et al. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units , 2020, ACL.
[12] Jason Baldridge,et al. Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval , 2021, Interspeech.
[13] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.
[14] Jordi Pont-Tuset,et al. Connecting Vision and Language with Localized Narratives , 2019, ECCV.
[15] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[16] James Glass,et al. Learning Words by Drawing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[18] Olivier Rosec,et al. SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, ArXiv.
[19] Kunio Kashino,et al. Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[21] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.
[22] Mathew Monfort,et al. Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[24] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[25] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[26] James R. Glass,et al. Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[27] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[28] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.
[29] Laurent Besacier,et al. Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[30] James Glass,et al. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.
[31] Gabriel Ilharco,et al. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.
[32] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.
[33] Andrew Zisserman,et al. QuerYD: A video dataset with high-quality textual and audio narrations , 2020, ArXiv.
[34] Mark Hasegawa-Johnson,et al. Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).