Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94\% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

[1]  J. Tangney,et al.  Too Good to Be True: Bots and Bad Data From Mechanical Turk. , 2022, Perspectives on psychological science : a journal of the Association for Psychological Science.

[2]  U. Noppeney Perceptual Inference, Learning, and Attention in a Multisensory World. , 2021, Annual review of neuroscience.

[3]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700-2020 Human Action Dataset , 2020, ArXiv.

[4]  Andrew Zisserman,et al.  The AVA-Kinetics Localized Human Actions Video Dataset , 2020, ArXiv.

[5]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[6]  Ale Smidts,et al.  Neural similarity at temporal lobe and cerebellum predicts out-of-sample preference and recall for video stimuli , 2019, NeuroImage.

[7]  Radoslaw Martin Cichy,et al.  Deep Neural Networks as Scientific Models , 2019, Trends in Cognitive Sciences.

[8]  Sean A. Dennis,et al.  Online Worker Fraud and Evolving Threats to the Integrity of MTurk Data: A Discussion of Virtual Private Servers and the Limitations of IP-Based Screening Procedures , 2019, Behavioral Research in Accounting.

[9]  J. Gray,et al.  PsychoPy2: Experiments in behavior made easy , 2019, Behavior research methods.

[10]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[12]  Matthias Bethge,et al.  Generalisation in humans and deep neural networks , 2018, NeurIPS.

[13]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[14]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Maja Pantic,et al.  End-to-End Audiovisual Fusion with LSTMs , 2017, AVSP.

[17]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[24]  Kevin Crowston,et al.  Amazon Mechanical Turk: A Research Tool for Organizations and Information Systems Scholars , 2012, Shaping the Future of ICT Research.

[25]  Uta Noppeney,et al.  Physical and Perceptual Factors Shape the Neural Mechanisms That Integrate Audiovisual Signals in Speech Comprehension , 2011, The Journal of Neuroscience.

[26]  U. Noppeney,et al.  Perceptual Decisions Formed by Accumulation of Audiovisual Evidence in Prefrontal Cortex , 2010, The Journal of Neuroscience.

[27]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Guido Rossum,et al.  Python Reference Manual , 2000 .

[29]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  Iain D Gilchrist,et al.  Perception of differences in naturalistic dynamic scenes, and a V1-based model. , 2015, Journal of vision.