Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.http://moments.csail.mit.edu/spoken.html

[1]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Qi Wu,et al.  Multilabel Image Classification With Regional Latent Semantic Dependencies , 2016, IEEE Transactions on Multimedia.

[3]  James Glass,et al.  Learning Words by Drawing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[5]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[6]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[7]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[8]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[9]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[11]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[12]  Quanfu Fan,et al.  Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[15]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[19]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Aoxue Li,et al.  Boosting Few-Shot Learning With Adaptive Margin Loss , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[22]  Bernt Schiele,et al.  Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[23]  Gabriel Ilharco,et al.  Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[24]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[25]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[26]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Luowei Zhou,et al.  Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.

[29]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[33]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Mirjam Ernestus,et al.  Language learning using Speech to Image retrieval , 2019, INTERSPEECH.

[37]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Garrison W. Cottrell,et al.  Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[43]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Xin Wang,et al.  VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Luc Van Gool,et al.  Object Referring in Visual Scene with Spoken Language , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[46]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[47]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[48]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[49]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[50]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[51]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[52]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[53]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[54]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[55]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[56]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[57]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  D. Damen,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.

[59]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[61]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[62]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[63]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[64]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  James Glass,et al.  AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.

[66]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[67]  Marcus Rohrbach,et al.  A Dataset for Telling the Stories of Social Media Videos , 2018, EMNLP.

[68]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[69]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[70]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[71]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Xiangyu Zhu,et al.  AdaptiveFace: Adaptive Margin and Sampling for Face Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Michael Picheny,et al.  Grounding Spoken Words in Unlabeled Video , 2019, CVPR Workshops.

[76]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[77]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[79]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.