AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos often rely on time-consuming and expensive data collection, such as human annotated textual summaries or machine generated automatic speech recognition transcripts. In this work, we introduce Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We circumvent the need for annotation and instead learn audio-visual language representations directly from randomly segmented video clips and their raw audio waveforms. We train AVLnet on publicly available instructional videos and evaluate our model on video clip and language retrieval tasks on three video datasets. Our proposed model outperforms several state-of-the-art text-video baselines by up to 11.8% in a video clip retrieval task, despite operating on the raw audio instead of manually annotated text captions. Further, we show AVLnet is capable of integrating textual information, increasing its modularity and improving performance by up to 20.3% on the video clip retrieval task. Finally, we perform analysis of AVLnet's learned representations, showing our model has learned to relate visual objects with salient words and natural sounds.

[1]  Grzegorz Chrupała Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques , 2021, J. Artif. Intell. Res..

[2]  Mark Hasegawa-Johnson,et al.  Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Mathew Monfort,et al.  Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jason Baldridge,et al.  Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval , 2021, Interspeech.

[5]  João F. Henriques,et al.  QUERYD: A Video Dataset with High-Quality Text and Audio Narrations , 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Rami Ben-Ari,et al.  Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.

[7]  Masood S. Mortazavi Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks , 2020, INTERSPEECH.

[8]  Mark Hasegawa-Johnson,et al.  A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions , 2020, INTERSPEECH.

[9]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[10]  Justin Salamon,et al.  Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[12]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[14]  James R. Glass,et al.  Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2019, ICLR.

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[18]  Gabriel Ilharco,et al.  Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[19]  Sandy Ritchie,et al.  Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data , 2019, INTERSPEECH.

[20]  Mark Hasegawa-Johnson,et al.  Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts , 2019, INTERSPEECH.

[21]  Mirjam Ernestus,et al.  Language learning using Speech to Image retrieval , 2019, INTERSPEECH.

[22]  Dima Damen,et al.  Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[24]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[25]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Hilde Kuehne,et al.  Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data , 2019, ArXiv.

[27]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  James Glass,et al.  Learning Words by Drawing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Karen Livescu,et al.  Semantic Query-by-example Speech Search Using Visual Grounding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Ivan Laptev,et al.  Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yansong Tang,et al.  COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Florian Metze,et al.  Learning from Multiview Correlations in Open-domain Videos , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Herman Arnold Engelbrecht,et al.  Multimodal One-shot Learning of Speech and Images , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Gregory Shakhnarovich,et al.  Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Michael Picheny,et al.  Grounding Spoken Words in Unlabeled Video , 2019, CVPR Workshops.

[43]  Florian Metze,et al.  How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.

[44]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[45]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[46]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[47]  Michael Roth,et al.  Visually grounded cross-lingual keyword spotting in speech , 2018, SLTU.

[48]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[49]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[50]  Juan Carlos Niebles,et al.  Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  James R. Glass,et al.  Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data , 2018, ArXiv.

[52]  Tae-Hyun Oh,et al.  On Learning Associations of Faces and Voices , 2018, ACCV.

[53]  Luowei Zhou,et al.  Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.

[54]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[55]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[56]  Ivan Laptev,et al.  Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[57]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[58]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[59]  Andrew Zisserman,et al.  Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Fadime Sener,et al.  Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Florian Metze,et al.  Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[63]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[66]  James R. Glass,et al.  Learning modality-invariant representations for speech and images , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[67]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Olivier Rosec,et al.  SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, INTERSPEECH 2017.

[69]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[70]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[71]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Juan Carlos Niebles,et al.  Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.

[74]  James R. Glass,et al.  Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[75]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[77]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[78]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[79]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[80]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[81]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[86]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[87]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[88]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[89]  Emmanuel Dupoux,et al.  Learning Words from Images and Speech , 2014 .

[90]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[91]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[92]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[93]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.