论文信息 - Grounding Spoken Words in Unlabeled Video

Grounding Spoken Words in Unlabeled Video

In this paper, we explore deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized. Specifically, we consider cooking show videos from the YouCook2 dataset and a subset of the YouTube-8M dataset. We introduce varying levels of supervision into the learning process to guide the sampling of audio-visual pairs for training the models. This includes (1) a fully-unsupervised approach that samples audio-visual segments uniformly from an entire video, and (2) sampling audio-visual segments using weak supervision from off-the-shelf automatic speech and visual recognition systems. Although these models are preliminary, even with no supervision they are capable of learning cross-modal correlations, and with weak supervision we see significant amounts of cross-modal learning.

[1] Aren Jansen,et al. Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.

[3] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[4] Rogério Schmidt Feris,et al. Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[5] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[6] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[7] Nuno Vasconcelos,et al. Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[8] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.

[9] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[10] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[12] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[13] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[14] Andrew Zisserman,et al. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[16] Chen Fang,et al. Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Samy Bengio,et al. Large Scale Online Learning of Image Similarity through Ranking , 2009, IbPRIA.

[18] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Tae-Hyun Oh,et al. On Learning Associations of Faces and Voices , 2018, ACCV.

[20] Gregory Shakhnarovich,et al. Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.